Data & Analytics Archives - abtasty

Transaction Testing With AB Tasty’s Report Copilot

Hubert Wassner — Thu, 18 Jul 2024 20:48:15 +0000

Transaction testing, which focuses on increasing the rate of purchases, is a crucial strategy for boosting your website’s revenue.

To begin, it’s essential to differentiate between conversion rate (CR) and average order value (AOV), as they provide distinct insights into customer behavior. Understanding these metrics helps you implement meaningful changes to improve transactions.

In this article, we’ll delve into the complexities of transaction metrics analysis and introduce our new tool, the “Report Copilot,” designed to simplify report analysis. Read on to learn more.

Transaction Testing

To understand how test variations impact total revenue, focus on two key metrics:

Conversion Rate (CR): This metric indicates whether sales are increasing or decreasing. Tactics to improve CR include simplifying the buying process, adding a “one-click checkout” feature, using social proof, or creating urgency through limited inventory.
Average Order Value (AOV): This measures how much each customer is buying. Strategies to enhance AOV include cross-selling or promoting higher-priced products.

By analyzing CR and AOV separately, you can pinpoint which metrics your variations impact and make informed decisions before implementation. For example, creating urgency through low inventory may boost CR but could reduce AOV by limiting the time users spend browsing additional products. After analyzing these metrics individually, evaluate their combined effect on your overall revenue.

Revenue Calculation

The following formula illustrates how CR and AOV influence revenue:

Revenue=Number of Visitors×Conversion Rate×AOV

In the first part of the equation (Number of Visitors×Conversion Rate), you determine how many visitors become customers. The second part (×AOV) calculates the total revenue from these customers.

Consider these scenarios:

If both CR and AOV increase, revenue will rise.
If both CR and AOV decrease, revenue will fall.
If either CR or AOV increases while the other remains stable, revenue will increase.
If either CR or AOV decreases while the other remains stable, revenue will decrease.
Mixed changes in CR and AOV result in unpredictable revenue outcomes.

The last scenario, where CR and AOV move in opposite directions, is particularly complex due to the variability of AOV. Current statistical tools struggle to provide precise insights on AOV’s overall impact, as it can experience significant random fluctuations. For more on this, read our article “Beyond Conversion Rate.”

While these concepts may seem intricate, our goal is to simplify them for you. Recognizing that this analysis can be challenging, we’ve created the “Report Copilot” to automatically gather and interpret data from variations, offering valuable insights.

Report Copilot

The “Report Copilot” from AB Tasty automates data processing, eliminating the need for manual calculations. This tool empowers you to decide which tests are most beneficial for increasing revenue.

Here are a few examples from real use cases.

Winning Variation:

The left screenshot provides a detailed analysis, helping users draw conclusions about their experiment results. Experienced users may prefer the summarized view on the right, also available through the Report Copilot.

Complex Use Case:

The screenshot above demonstrates a case where CR and OAV have opposite trends and need a deeper understanding of the context.

It’s important to note that the Report Copilot doesn’t make decisions for you; it highlights the most critical parts of your analysis, allowing you to make informed choices.

Conclusion

Transaction analysis is complex, requiring a breakdown of components like conversion rate and average order value to better understand their overall effect on revenue.

We’ve developed the Report Copilot to assist AB Tasty users in this process. This feature leverages AB Tasty’s extensive experimentation dashboard to provide comprehensive, summarized analyses, simplifying decision-making and enhancing revenue strategies.

The post Transaction Testing With AB Tasty’s Report Copilot appeared first on abtasty.

What is One-to-One Personalization in Marketing? (With 8 Examples)

Emily Healy — Mon, 24 Jun 2024 11:00:00 +0000

It’s no secret that today’s digital marketplace is highly competitive. Consumers are exposed to an increasingly high number of messages each day. How can you make your message relevant to your consumers and break through the noise?

To capture consumers’ attention, brands need to focus their attention on crafting unique user experiences to deliver 1:1 personalization based on data.

One of the most important focal points to convert visitors into customers and build customer loyalty is 1:1 personalization. More and more customers feel less motivated to complete a transaction when they’re online shopping if their experience is impersonal. Let’s take a look at some data from Forbes:

80% of consumers are more likely to complete an online purchase with brands that offer personalized customer experiences.
72% of consumers explain that they only interact with personalized messaging.
66% of consumers share that coming across content that isn’t personalized would deter them from purchasing.

Customers want personalization. Think about when you walk into a physical store and an employee really listens to your needs, helps you find exactly what you’re looking for, or goes above and beyond your expectations to help you. That is exactly what customers want in the digital marketplace.

A unique, digital one-to-one personalization experience strategy gives companies the potential to customize messages, offers, and other experiences to each website visitor based on data collected about each user.

Digital one-to-one personalization starts with concrete data. Are you leveraging data to better serve and convert your visitors?

To help you answer “yes” to this question, we’ll take a deeper look at:

What is 1:1 personalization?
What data helps you know your customers better?
How to find data?
How to leverage the data you find
What messages should you personalize?
Examples of one-to-one personalization

What is one-to-one personalization in marketing?

Delivering a unique (or one-to-one) experience to each online consumer is a technique known as one-to-one personalization in marketing.

By mastering the technique of 1:1 personalization, brands can deliver an exceptional level of customer service by providing personalized messages, product recommendations, offers, and specialized content at the right time based on the user’s needs and expectations.

This type of unique user experience is only made possible thanks to the availability of extensive customer data. If you don’t get to know your customers based on their interactions with your brand and user behavior, you’re missing an opportunity to meet your customers’ expectations.

One goal of personalization is to create a “wow” effect. This means you should be making the customer think, “wow, they really know me.” The more information that a company knows about a certain customer, the more personalized the user experience will be.

Without extensive, personalized data, one-to-one personalization isn’t achievable.

What data to collect to improve your customer experience with personalization?

On a wider scale, it’s important to understand the location of your customer, their demographic information (age, gender, education level), purchasing habits, and website browsing information. However, in the hypercompetitive world of personalization, this surface-level data is not enough.

Brands need to move beyond knowing who the customer is and understand how the customer behaves.

Knowing that your customer is a recent college graduate who lives in New York City and spends a lot of time making Pinterest boards will not be enough information to create a strong buyer persona to achieve a unique and pleasant user experience.

Enhancing your customer’s profile will require you to collect relevant data about how your customer interacts with your brand on all channels, what motivates them to purchase, and what makes them tick on top of knowing who they are.

More specifically, robust personalized data will help you better understand:

Location and demographics
Interests and hobbies
Shopping and purchasing habits
Device and channel frequency
Where and how they prefer to shop and purchase
Satisfaction level
Likes and dislikes

All of this information will allow you to create a sophisticated customer profile. Understanding their motivations, preferences, and expectations helps you characterize users into intricate market segmentations to give them the best possible experience imaginable.

Ideally, the customer will have a positive experience and feel unique based on the information derived from the robust data collection.

How do you find user data?

Extensive data can be found and refined by cross-indexing information stored on separate databases.

For example, you can harvest personalized data from a customer’s interactions with your business by analyzing and storing comments on social media sites, ratings on review sites, mobile app usage vs. desktop usage, customer service interactions, download requests, and more.

How to leverage one-to-one personalization with personalized data

As you can see, personalization cannot exist without data. To achieve one-to-one personalization on your digital channels, your brand must have the ability to transform the collected data into action.

After monitoring and gathering rich data on your customer’s interactions, history, and behavior on your site, it’s time to convert this personalized data into a refined customer buyer persona to serve your customers better.

By segmenting your profiles, you will be able to better understand your customer’s preferences and pain points, which will help you craft these personalized messages and display them at the right time.

How to personalize interactions with customers:

Once you have substantial personalized data collected about your visitors, you can determine the best way to interact with them. There is a fine line between being helpful by displaying personalized messages and being invasive.

The difference in these two feelings will depend on the amount of prior engagement that the customer has with you. For example, a customer who is subscribed to every newsletter has a company discount card and frequently completes transactions on your website will expect you to know their preferences fairly well, like a regular coming into a coffee shop. On the other hand, a first-time visitor will not expect you to know much about them, but they will expect to be welcomed.

The best way to understand how to serve your customers is by asking yourself how you would want to be interacted with at their level of engagement with your brand. What would make you feel welcomed and what would make you feel overwhelmed or uneasy?

What messages should you personalize?

The possibilities for personalized messages can stretch as far as your mind (or your software capabilities) will allow.

Think about personalization in a broad sense. Let’s say a company wants to put its logo onto personalized gifts for its employees. The company’s logo can be put onto t-shirts, pens, stickers, coffee mugs, phone cases, backpacks, sunglasses, golf balls, holiday baskets– the possibilities are nearly endless. The same goes for personalized messages for your own customers.

In marketing communication, some of the most common outlets for 1:1 personalization are:

Product recommendations
Emails (subject lines and content)
Intro and exit banners
Pop-up messages
Conversational marketing (chat boxes)
Offers and discounts
Language
Landing pages
Pricing
Greetings

To attract and retain your customer’s attention in a market filled with saturated messages, your brand should focus on personalization as much as possible and in as many channels as you can.

What platform to use for one-to-one personalization in marketing?

The journey to a seamless one-to-one personalization, or one-to-one marketing, experience for your customers starts with sophisticated and intuitive software to help transform your ideas into reality.

AB Tasty is the complete platform for experimentation and personalization equipped with the tools you need to create a richer digital experience for your customers — fast. With embedded AI and automation, this platform can help you achieve omnichannel personalization and revolutionize your brand and product experiences.

What is omnichannel personalization?

In marketing, employing one-to-one personalization across multiple channels, platforms, and touchpoints is commonly referred to as omnichannel personalization.

Customers crave personalization wherever they are – on a mobile device, desktop, social media platform, mobile app, or email. When customers receive a personalized experience, they expect this standard of communication across all channels or platforms that they are interacting with.

Achieving omnichannel personalization requires a seamless flow of customer data from one platform or channel to the next. By gathering information on user preferences, behavior, and interests from all virtual touchpoints, your customer’s profile strengthens.

By receiving this consistent level of personalization across all channels, consumers will be inclined to purchase more and to purchase again from the same brand that made them feel seen and heard.

What are the advantages of omnichannel personalization?

Higher conversion rates
Increased average order value (AOV)
Reduced cart abandonment
Improved brand value and customer loyalty
Higher customer lifetime value
Delivering messages at the right time and place

8 Examples of 1-1 Personalization strategies from retail brands

1. ASOS’s Social Connection

Online retailer ASOS prides itself on offering both new and existing customers a range of personalized discounts and deals, which vary depending on if:

It’s a new customer
It’s a returning customer that’s demonstrated a particular interest (e.g. shoes)
A regular customer (who could then be offered premium next-day delivery, for example)

But how does ASOS get this information? One method they might use is encouraging customers to log in to the site using social media platforms, which would allow ASOS to access further details such as age, gender, and location—which can then be used to tailor even more personalized messages.

Why it works: The ability to use a social platform for account creation makes the process simple for shoppers, while giving ASOS more insight into what deals or promotions would be of the most interest to them.

2. Nordstrom Remembers Your Size

Nordstrom gave its online shopping cart a simple yet effective personal touch: remembering returning customers’ clothing sizes. This may not seem like a massive approach to deliver a personal experience, but it creates a more seamless checkout for the user and brings them one step closer to the purchase. It’s a rather clever move from Nordstrom that hasn’t gone unnoticed.

Why it works: Remembering the customers’ preferred size (based on previous purchases) instantly shows the brand’s attentiveness while making checkout even more simple.

3. Clarins personalization and gamification

Before the booming holiday season, Clarins, a multinational cosmetics company, saw an 89% increase in their conversion rate and a 145% increase in the add-to-basket metric by implementing 1:1 personalization and gamification with AB Tasty.

On Single’s Day, a few weeks before Black Friday, Clarins saw a perfect opportunity to experiment and learn culture by implementing a “Wheel of Fortune” concept in certain countries. The gamification gifts were personalized according to each country’s local culture. Any visitor arriving at their website would play the digital game, spin the wheel, and receive a gift automatically in their inbox. This ease of automatic implementation was a great user experience, especially for mobile visitors.

Read the full story here: How Clarins Uses AB Tasty for Personalization and Retention

4. Amazon’s ‘Recommended For You’ Approach

Amazon is no stranger to personalization marketing. In fact, it could be argued they were the first major e-commerce retailer to really put personalization into action. The company has become known for its product recommendation emails and personalized homepages for logged-in customers. Using their own algorithm, A9, Amazon goes above and beyond to first understand customers’ buying habits and then deliver an experience that’s been deliberately designed for relevance.

Why it works: Customers feel valued and understood by the retailer when seeing emails and recommended “picks” that are tailored to their interests. Consistency also plays a part in Amazon’s approach, as they continue to deliver an even more granular personalized approach for customers.

5. Nike and Their Customized Approach

Nike always goes the extra mile to personalize the shopping experience, as we’ve seen with their SNKRs app that allows premium (loyalty, Nike+ shoppers) access to a large catalog of products that they can then customize. It’s the perfect way to cement customer loyalty by offering them the unique opportunity to tailor items to their exact liking.

Why it works: By giving customers a certain degree of autonomy with design, Nike is giving customers the freedom to express their individuality, even while the company continues to produce the same style of shoe around the world. Despite being a huge brand, Nike has created a great loyalty program that engages customers and stokes their excitement about buying Nike products.

6. Net-A-Porter’s Personalized Touch

Luxury online retailer Net-A-Porter has adopted the ‘recommended for you’ approach but with a unique twist to appeal to its high-end customers who want a more premium service when they shop. The company gives away freebie products to customers based on previous purchases, adding a personal touch to an otherwise standard online shopping experience. This is not unlike Amazon’s recommended emails, except Net-A-Porter customers receive a physical product — and who doesn’t like a gift!

Why it works: These gifts show the appreciation Net-A-Porter has for its customers and help to bring the luxury shopping experience online.

7. Coca-Cola’s Name Campaign

In 2011, Coca-Cola launched its Share a Coke campaign in Australia, printing thousands of names on their diet and original soft drinks. This simple yet effective campaign made sales skyrocket, supporting the notion that consumers engage with brands that address them by their first name (albeit in a rather broad sense!) Personalized bottles became all the rage, with people trying to find their own names along with those of their friends and family members. The campaign was globally recognized and started the ball rolling for other brands such as Marmite, which also saw great success with a naming campaign.

Why it works: Is it the simple notion of vanity that makes these name campaigns so popular? Consumers love to see their own names on popular products, making them almost ‘gimmicky’ with a collectible edge that makes people feel special!

8. Target’s Guest ID

The US retail giant Target decided to up its personalized campaign game by assigning each customer a guest identification number on their first interaction with the brand. Target then used the data to obtain customer details like buying behavior and even job history! Target used personalized data to understand the consumer habits of its customers and to create a view of their individual lifestyles. Target focused particularly on customers who also had a baby registry with them and even used their marketing data to make ‘pregnancy predictability scores’ for customers who were browsing particular items!

Why it works: Arguably, delivering a personalized experience for every customer visiting a physical store is a tough job for any retailer. By assigning a ‘guest ID’, Target was able to understand buying behaviors and patterns from their customers in-store and use the information to make suggestions on products they may be interested in.

Everyone wins with one-to-one personalization

The data you collect equally benefits your brand and your customers. By understanding what your customers are looking for, you save them time by providing them with informed recommendations, personalized messages, and unique experiences to solve their pain points.

Without proper data collection or genuine segmentation, it’s nearly impossible to provide users with a 1:1 personalized experience. Loyal customers want to feel like their brand really knows them and what they’re looking for. Achieve one-to-one personalized experiences by correctly analyzing and leveraging personalized data. If you’re looking to serve your customers, increase sales, and build brand loyalty at the same time, you’ve found your blueprint with personalization.

The post What is One-to-One Personalization in Marketing? (With 8 Examples) appeared first on abtasty.

Mutually Exclusive Experiments: Preventing the Interaction Effect

Hubert Wassner — Thu, 23 May 2024 18:32:35 +0000

What is the interaction effect?

If you’re running multiple experiments at the same time, you may find their interpretation to be more difficult because you’re not sure which variation caused the observed effect. Worse still, you may fear that the combination of multiple variations could lead to a bad user experience.

It’s easy to imagine a negative cumulative effect of two visual variations. For example, if one variation changes the background color, and another modifies the font color, it may lead to illegibility. While this result seems quite obvious, there may be other negative combinations that are harder to spot.

Imagine launching an experiment that offers a price reduction for loyal customers, whilst in parallel running another that aims to test a promotion on a given product. This may seem like a non-issue until you realize that there’s a general rule applied to all visitors, which prohibits cumulative price reductions – leading to a glitch in the purchase process. When the visitor expects two promotional offers but only receives one, they may feel frustrated, which could negatively impact their behavior.

What is the level of risk?

With the previous examples in mind, you may think that such issues could be easily avoided. But it’s not that simple. Building several experiments on the same page becomes trickier when you consider code interaction, as well as interactions across different pages. So, if you’re interested in running 10 experiments simultaneously, you may need to plan ahead.

A simple solution would be to run these tests one after the other. However, this strategy is very time consuming, as your typical experiment requires two weeks to be performed properly in order to sample each day of the week twice.

It’s not uncommon for a large company to have 10 experiments in the pipeline and running them sequentially will take at least 20 weeks. A better solution would be to handle the traffic allocated to each test in a way that renders the experiments mutually exclusive.

This may sound similar to a multivariate test (MVT), except the goal of an MVT is almost the opposite: to find the best interaction between unitary variations.

Let’s say you want to explore the effect of two variation ideas: text and background color. The MVT will compose all combinations of the two and expose them simultaneously to isolated chunks of the traffic. The isolation part sounds promising, but the “all combinations” is exactly what we’re trying to avoid. Typically, the combination of the same background color and text will occur. So an MVT is not the solution here.

Instead, we need a specific feature: A Mutually Exclusive Experiment.

What is a Mutually Exclusive Experiment (M2E)?

AB Tasty’s Mutually Exclusive Experiment (M2E) feature enacts an allocation rule that blocks visitors from entering selected experiments depending on the previous experiments already displayed. The goal is to ensure that no interaction effect can occur when a risk is identified.

How and when should we use Mutually Exclusive Experiments?

We don’t recommend setting up all experiments to be mutually exclusive because it reduces the number of visitors for each experiment. This means it will take longer to achieve significant results and the detection power may be less effective.

The best process is to identify the different kinds of interactions you may have and compile them in a list. If we continue with the cumulative promotion example from earlier, we could create two M2E lists: one for user interface experiments and another for customer loyalty programs. This strategy will avoid negative interactions between experiments that are likely to overlap, but doesn’t waste traffic on hypothetical interactions that don’t actually exist between the two lists.

What about data quality?

With the help of an M2E, we have prevented any functional issues that may arise due to interactions, but you might still have concerns that the data could be compromised by subtle interactions between tests.

Would an upstream winning experiment induce false discovery on downstream experiments? Alternatively, would a bad upstream experiment make you miss an otherwise downstream winning experiment? Here are some points to keep in mind:

Remember that roughly eight tests out of 10 are neutral (show no effect), so most of the time you can’t expect an interaction effect – if no effect exists in the first place.
In the case where an upstream test has an effect, the affected visitors will still be randomly assigned to the downstream variations. This evens out the effect, allowing the downstream experiment to correctly measure its potential lift. It’s interesting to note that the average conversion rate following an impactful upstream test will be different, but this does not prevent the downstream experiment from correctly measuring its own impact.
Remember that the statistical test is here to take into account any drift of the random split process. The drift we’re referring to here is the fact that more impacted visitors of the upstream test could end up in a given variation creating the illusion of an effect on the downstream test. So the gain probability estimation and the confidence interval around the measured effect is informing you that there is some randomness in the process. In fact, the upstream test is just one example among a long list of possible interfering events – such as visitors using different computers, different connection quality, etc.

All of these theoretical explanations are supported by an empirical study from the Microsoft Experiment Platform team. This study reviewed hundreds of tests on millions of visitors and saw no significant difference between effects measured on visitors that saw just one test and visitors that saw an additional upstream test.

Conclusion

While experiment interaction is possible in a specific context, there are preventative measures that you may take to avoid functional loss. The most efficient solution is the Mutually Exclusive Experiment, allowing you to eliminate the functional risks of simultaneous experiments, make the most of your traffic and expedite your experimentation process.

References:

https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/a-b-interactions-a-call-to-relax/

The post Mutually Exclusive Experiments: Preventing the Interaction Effect appeared first on abtasty.

The Truth Behind the 14-Day A/B Test Period

Hubert Wassner — Tue, 14 May 2024 18:17:01 +0000

The A/B testing method involves a simple process: create two variations, expose them to your customer, collect data, and analyze the results with a statistical formula.

But, how long should you wait before collecting data? With 14 days being standard practice, let’s find out why as well as any exceptions to this rule.

Why 14 days?

To answer this question we need to understand what we are fundamentally doing. We are collecting current data within a short window, in order to forecast what could happen in the future during a more extended period. To simplify this article, we will only focus on explaining the rules that relate to this principle. Other rules do exist, which mostly correlate to the number of visitors, but this can be addressed in a future article.

The forecasting strategy relies on the collected data containing samples of all event types that may be encountered in the future. This is impossible to fulfill in practice, as periods like Christmas or Black Friday are exceptional events relative to the rest of the year. So let’s focus on the most common period and set aside these special events that merit their own testing strategies.

If the future we are considering relates to “normal” times, our constraint is to sample each day of the week uniformly, since people do not behave the same on different days. Simply look at how your mood and needs shift between weekdays and weekends. This is why a data sampling period must include entire weeks, to account for fluctuations between the days of the week. Likewise, if you sample eight days for example, one day of the week will have a doubled impact, which doesn’t realistically represent the future either.

This partially explains the two-week sampling rule, but why not a longer or shorter period? Since one week covers all the days of the week, why isn’t it enough? To understand, let’s dig a little deeper into the nature of conversion data, which has two dimensions: visits and conversions.

Visits: as soon as an experiment is live, every new visitor increments the number of visits.
Conversions: as soon as an experiment is live, every new conversion increments the number of conversions.

It sounds pretty straightforward, but there is a twist: statistical formulas work with the concept of success and failure. The definition is quite easy at first:

Success: the number of visitors that did convert.
Failures: the number of visitors that didn’t convert.

At any given time a visitor may be counted as a failure, but this could change a few days later if they convert, or the visit may remain a failure if the conversion didn’t occur.

So consider these two opposing scenarios:

A visitor begins his buying journey before the experiment starts. During the first days of the experiment he comes back and converts. This would be counted as a “success”, but in fact he may not have had time to be impacted by the variation because the buying decision was made before he saw it. The problem is that we are potentially counting a false success: a conversion that could have happened without the variation.
A visitor begins his buying journey during the experiment, so he sees the variation from the beginning, but doesn’t make a final decision before the end of the experiment – finally converting after it finishes. We missed this conversion from a visitor who saw the variation and was potentially influenced by it.

These two scenarios may cancel each other out since they have opposite results, but that is only true if the sample period exceeds the usual buying journey time. Consider a naturally long conversion journey, like buying a house, measured within a very short experiment period of one week. Clearly, no visitors beginning the buying journey during the experiment period would have time to convert. The conversion rates of these visitors would be artificially in the realm of zero – no proper measurements could be done in this context. In fact, the only conversions you would see are the ones from visitors that began their journey before the variation even existed. Therefore, the experiment would not be measuring the impact of the variation.

The delay between the effective variation and the conversion expedites the conversion rate. In order to mitigate this problem, the experiment period has to be twice as long as the standard conversion journey. Doing so ensures that visitors entering the experiment during the first half will have time to convert. You can expect that people who began their journey before the experiment and people entering during the second half of the experiment period will cancel each other out: The first group will contain conversions that should not be counted, and some of the second group’s conversions will be missing. However, a majority of genuine conversions will be counted.

That’s why a typical buying journey of one week results in a two-week experiment, offering the right balance in terms of speed and accuracy of the measurements.

Exceptions to this rule

A 14-day experiment period doesn’t apply to all cases. If the delay between the exposed variation and the conversion is 1.5 weeks for instance, then your experiment period should be three weeks, in order to cover the usual conversion delay twice.

On the other hand, if you know that the delay is close to zero, such as in the case of a media website, where you are trying to optimize the placement of an advertisement frame on a page where visitors only stay a few minutes, you may think that one day would be enough based on the this logic, but it’s not.

The reason being that you would not sample every day of the week, and we know from experience that people do not behave the same way throughout the week. So even in a zero-delay context, you still need to conduct the experiment for an entire week.

Takeaways:

Your test period should mirror the conditions of your expected implementation period.
Sample each day of the week in the same way.
Wait an integer number of weeks before closing an A/B test.

Respecting these rules will ensure that you’ll have clean measures. The accuracy of the measure is defined by another parameter of the experiment: the total number of visitors. We’ll address this topic in another article – stay tuned.

The post The Truth Behind the 14-Day A/B Test Period appeared first on abtasty.

Analytics Reach New Heights With Google BigQuery + AB Tasty

John Hughes — Mon, 25 Mar 2024 09:40:10 +0000

AB Tasty and Google BigQuery have joined forces to provide seamless integration, enabling customers with extensive datasets to access insights, automate, and make data-driven decisions to push their experimentation efforts forward.

We have often discussed the complexity of understanding data to power your experimentation program. When companies are dealing with massive datasets they need to find an agile and effective way to allow that information to enrich their testing performance and to identify patterns, trends, and insights.

Go further with data analytics

Google BigQuery is a fully managed cloud data warehouse solution, which enables quick storage and analysis of vast amounts of data. This serverless platform is highly scalable and cost-effective, tailored to support businesses in analyzing extensive datasets for making well-informed decisions.

With Google BigQuery, users can effortlessly execute complex analytical SQL queries, leveraging its integrated machine-learning capabilities.

This integration with AB Tasty’s experience optimization platform means customers with large datasets can use BigQuery to store and analyze large volumes of testing data. By leveraging BigQuery’s capabilities, you can streamline data analysis processes, accelerate experimentation cycles, and drive innovation more effectively.

Here are some of the many benefits of Google BigQuery’s integration with AB Tasty to help you trial better:

BigQuery as a data source

With AB Tasty’s integration, specific data from AB Tasty can be sent regularly to your BigQuery set. Each Data Ingestion Task has a name, an SQL query to get what you need, and timed frequency for data retrieval. This information helps make super-focused ads and messages, making it easier to reach the right people.

Centralized storage of data from AB Tasty

The AB Tasty and BigQuery integration simplifies campaign analysis too by eliminating the need for SQL or BI tools. Their dashboard displays a clear comparison of metrics on a single page, enhancing efficiency. You can leverage BigQuery for experiment analysis without duplicating reporting in AB Tasty, getting the best of both platforms. Incorporate complex metrics and segments by querying our enriched events dataset and link event data with critical business data from other platforms. Whether through web or feature experimentation, it means more accurate experiments at scale to drive business growth and success.

Machine learning

BigQuery can also be used for machine learning on experimentation programs, helping you to predict outcomes and better understand your specific goals. BigQuery gives you AI-driven predictive analytics for scaling personalized multichannel campaigns, free from attribution complexities or uncertainties. Access segments that dynamically adjust to real-time customer behavior, unlocking flexible, personalized, and data-driven marketing strategies to feed into your experiments.

Enhanced segmentation and comprehensive insight

BigQuery’s ability to understand behavior means that you can segment better. Its data segmentation allows for categorizing users based on various attributes or behaviors. With data that is sent to Bigquery from experiments, you can create personalized content or features tailored to specific user groups, optimizing engagement and conversion rates.

Finally, the massive benefit of this integration is to get joined-up reporting – fully automated and actionable reports on experimentation, plus the ability to feed data from other sources to get the full picture.

A continued partnership

This integration comes after Google named AB Tasty an official Google Cloud Partner last year, making us available on the Google Cloud Marketplace to streamline marketplace transactions. We are also fully integrated with Google Analytics 4. We were also thrilled to be named as one of the preferred vendors from Google for experimentation after the Google Optimize sunset.

As we continue to work closely with the tech giant to help our customers continue to grow, you can find out more about this integration here.

The post Analytics Reach New Heights With Google BigQuery + AB Tasty appeared first on abtasty.

How to Better Handle Collateral Effects of Experimentation: Dynamic Allocation vs Sequential Testing

Hubert Wassner — Thu, 07 Dec 2023 10:12:48 +0000

When talking about web experimentation, the topics that often come up are learning and earning. However, it’s important to remember that a big part of experimentation is encountering risks and losses. Although losses can be a touchy topic, it’s important to talk about and destigmatize failed tests in experimentation because it encourages problem-solving, thinking outside of your comfort zone and finding ways to mitigate risk.

Therefore, we will take a look at the shortcomings of classic hypothesis testing and look into other options. Basic hypothesis testing follows a rigid protocol:

Creating the variation according to the hypothesis
Waiting a given amount of time
Analyzing the result
Decision-making (implementing the variant, keeping the original, or proposing a new variant)

This rigid protocol and simple approach to testing doesn’t say anything about how to handle losses. This raises the question of what happens if something goes wrong? Additionally, the classic statistical tools used for analysis are not meant to be used before the end of the experiment.

If we consider a very general rule of thumb, let’s say that out of every 10 experiments, 8 will be neutral (show no real difference), one will be positive, and one will be negative. Practicing classic hypothesis testing suggests that you just accept that as a collateral effect of the optimization process hoping to even it out in the long term. It may feel like crossing a street blindfolded.

For many, that may not cut it. Let’s take a look at two approaches that try to better handle this problem:

Dynamic allocation – also known as “Multi Armed Bandit” (MAB). This is where traffic allocation changes for each variation according to their performance, implicitly lowering the losses.
Sequential testing – a method that allows you to stop a test as soon as possible, given a risk aversion threshold.

These approaches are statistically sound but they come with their assumptions. We will go through their pros and cons within the context of web optimization.

First, we’ll look into the classic version of these two techniques and their properties and give tips on how to mitigate some of their problems and risks. Then, we’ll finish this article with some general advice on which techniques to use depending on the context of the experiment.

Dynamic allocation (DA)

Dynamic allocation’s main idea is to use statistical formulas that modify the amount of visitors exposed to a variation depending on the variation’s performance.

This means a poor-performing variation will end up having little traffic which can be seen as a way to save conversions while still searching for the best-performing variation. Formulas ensure the best compromise between avoiding loss and finding the real best-performing variation. However, this implies a lot of assumptions that are not always met and that make DA a risky option.

There are two main concerns, both of which are linked to the time aspect of the experimentation process:

The DA formula does not take time into account

If there is a noticeable delay between the variation exposure and the conversion, the algorithm may go wrong resulting in a visitor being considered a ‘failure’ until they convert. This means that the time between a visit and a conversion will be falsely counted as a failure.

As a result, the DA will use the wrong conversion information in its formula so that any variation gaining traffic will automatically see a (false) performance drop because it will detect a growing number of non-converting visitors. As a result, traffic to that variation will be reduced.

The reverse may also be true: a variation with decreasing traffic will no longer have any new visitors while existing visitors of this variation could eventually convert. In that sense, results would indicate a (false) rise in conversions even when there are no new visitors, which would be highly misleading.

DA gained popularity within the advertising industry where the delay between an ad exposure and its potential conversion (a click) is short. That’s why it works perfectly well in this context. The use of Dynamic Allocation in CRO must be done in a low conversion delay context only.

In other words, DA should only be used in scenarios where visitors convert quickly. It’s not recommended for e-commerce except for short-term campaigns such as flash sales or when there’s not enough traffic for a classic AB test. It can also be used if the conversion goal is clicking on an ad on a media website.

DA and the different days of the week

It’s very common to see different visitor behavior depending on the day of the week. Typically, customers may behave differently on weekends than during weekdays.

With DA, you may be sampling days unevenly, implicitly giving more weight on some days for some variations. However, you should weigh each day the same because, in reality, you have the same amount of weekdays. You should only use Dynamic Allocation if you know that the optimized KPI is not sensitive to fluctuations during the week.

The conclusion is that DA should be considered only when you expect too few total visitors for classic A/B testing. Another requirement is that the KPI under experimentation needs a very short conversion time and no dependence on the day of the week. Taking all this into account: Dynamic Allocation should not be used as a way to secure conversions.

Sequential Testing (ST)

Sequential Testing is when a specific statistical formula is used enabling you to stop an experiment. This will depend on the performance of variations with given guarantees on the risk of false positives.

The Sequential Testing approach is designed to secure conversions by stopping a variation as soon as its underperformance is statistically proven.

However, it still has some limitations. When it comes to effect size estimation, the effect size may be wrong in two senses:

Bad variations will be seen as worse than they really are. It’s not a problem in CRO because the false positive risk is still guaranteed. This means that in the worst-case scenario, you will discard not a strictly losing variation but maybe just an even one, which still makes sense in CRO.
Good variations will be seen as better than they really are. It may be a problem in CRO since not all winning variations are useful for business. The effect size estimation is key to business decision-making. This can easily be mitigated by using sequential testing to stop losing variations only. Winning variations, for their part, should be continued until the planned end of the experiment, ensuring both correct effect size estimation and an even sampling for each day of the week.
It’s important to note that not all CRO software use this hybrid approach. Most of them use ST to stop both winning and losing variations, which is wrong as we’ve just seen.

As we’ve seen, by stopping a losing variation in the middle of the week, there’s a risk you may be discarding a possible winning variation.

However, to actually have a winning variation after ST has shown that it’s underperforming, this variation will need to perform so well that it becomes even with the reference. Then, it would also have to perform so well that it outperforms the reference and all that would need to happen in a few days. This scenario is highly unlikely.

Therefore, it’s safe to stop a losing variation with Sequential Testing, even if all weekdays haven’t been evenly sampled.

The best of both worlds in CRO

Dynamic Allocation is the best approach to experimentation instead of static allocation when you expect a small volume of traffic. It should be used only in the context of ‘short delay KPI’ and with no known weekday effect (for example: flash sales). However, it’s not a way to mitigate risk in a CRO strategy.

To be able to run experiments with all the needed guarantees, you need a hybrid system using Sequential Testing to stop losing variations and a classic method to stop a winning variation. This method will allow you to have the best of both worlds.

The post How to Better Handle Collateral Effects of Experimentation: Dynamic Allocation vs Sequential Testing appeared first on abtasty.

Harmony or Dissonance: Decoding Data Divergence Between AB Tasty and Google Analytics

Julie Dumont — Wed, 22 Nov 2023 09:34:30 +0000

The world of data collection has grown exponentially over the years, providing companies with crucial information to make informed decisions. However, within this complex ecosystem, a major challenge arises: data divergence.

Two analytics tools, even if they seem to be following the same guidelines, can at times produce different results. Why do they differ? How do you leverage both sets of data for your digital strategy?

In this article, we’ll use a concrete example of a user journey to illustrate differences in attribution between AB Tasty and Google Analytics. GA is a powerful tool for gathering and measuring data across the entire user journey. AB Tasty lets you easily make changes to your site and measure the impact on specific goals.

Navigating these differences in attribution strategies will explain why there can be different figures across different types of reports. Both are important to look at and which one you focus on will depend on your objectives:

Specific improvements in cross-session user experiences
Holistic analysis of user behavior

Let’s dive in!

Breaking it down with a simple use case

We’re going to base our analysis on a deliberately very basic use case, based on the user journey of a single visitor.

Campaign A is launched before the first session of the visitor and remains live until the end which occurs after the 3rd session of the visitor.

Here’s an example of the user journey we’ll be looking at in the rest of this article:

Session 1: first visit, Campaign A is not triggered (the visitor didn’t match all of the targeting conditions)
Session 2: second visit, Campaign A is triggered (the visitor matched all of the targeting conditions)
Session 3: third visit, no re-triggering of Campaign A which is still live, and the user carries out a transaction.

NB A visitor triggers a campaign as soon as they meet all the targeting conditions:

They meet the segmentation conditions

During their session, they visit at least one of the targeted pages

They meet the session trigger condition.

In A/B testing, a visitor exposed to a variation of a specific test will continue to see the same variation in future sessions, as long as the test campaign is live. This guarantees reliable measurement of potential changes in behavior across all sessions.

We will now describe how this user journey will be taken into account in the various AB Tasty and GA reports.

Analysis in AB Tasty

In AB Tasty, there is only one report and therefore only one attribution per campaign.

The user journey above will be reported as follows for Campaign A:

Total Users (Unique visitors) = 1, based on a unique user ID contained in a cookie; here there is only one user in our example.
Total Session = 2, s2 and s3, which are the sessions that took place during and after the display of Campaign A, are taken into account even if s3 didn’t re-trigger campaign A
Total Transaction = 1, the s3 transaction will be counted even if s3 has not re-triggered Campaign A.

In short, AB Tasty will collect and display in Campaign A reporting all the visitor’s sessions and events from the moment the visitor first triggered the campaign.

Analysis in Google Analytics

The classic way to analyze A/B test results in GA is to create an analysis segment and apply it to your reports.

However, this segment can be designed using 2 different methods, 2 different scopes, and depending on the scope chosen, the reports will not present the same data.

Method 1: On a user segment/user scope

Here we detail the user scope, which will include all user data corresponding to the segment settings.

In our case, the segment setup might look something like this:

This segment will therefore include all data from all sessions of all users who, at some point during the analysis date range, have received an event with the parameter event action = Campaign A.

We can then see in the GA report for our user journey example:

Total User = 1, based on a user ID contained in a cookie (like AB Tasty); here there is only one user in our example
Total Session = 3, s1, s2 and s3 which are the sessions created by the same user entering the segment and therefore includes all their sessions
Total Transaction = 1, transaction s3 will be counted as it took place in session s3 after the triggering of the campaign.

In short, in this scenario, Google Analytics will count and display all the sessions and events linked to this single visitor (over the selected date range), even those prior to the launch of Campaign A.

Method 2: On a session segment/session scope

The second segment scope detailed below is the session scope. This includes only the sessions that correspond to the settings.

In this second case, the segment setup could look like this:

This segment will include all data from sessions that have, at some point during the analysis date range, received an event with the parameter event action = Campaign A.

As you can see, this setting will include fewer sessions than the previous one.

In the context of our example:

Total User = 1, based on a user ID contained in a cookie (like AB Tasty), here there’s only one user in our example
Total Session = 1, only s2 triggers campaign A and therefore sends the campaign event
Total Transaction = 0, the s3 transaction took place in the s3 session, which does not trigger campaign A and therefore does not send an event, so it is not taken into account.

In short, in this case, Google Analytics will count and display all the sessions – and the events linked to these sessions – that triggered campaign A, and only these.

Attribution model

Tool – scope	Counted in the selected timeframe
AB Tasty	All sessions and events that took place after the visitor first triggered campaign A
Google Analytics – user scope	All sessions and events of a user that triggered campaign A at least once during one their sessions
Google Analytics – session scope	Only sessions that have triggered campaign A

Different attribution for different objectives

Depending on the different attributions of the various reports, we can observe different figures without the type of tracking being different.

The only metric that always remains constant is the sum of Users (Unique visitors in AB Tasty). This is calculated in a similar (but not identical) way between the 2 tools. It’s therefore the benchmark metric, and also the most reliable for detecting malfunctions between A/B testing tools and analytics tools with different calculations.

On the other hand, the attribution of sessions or events (e.g. a transaction) can be very different from one report to another. All the more so as it’s not possible in GA to recreate a report with an attribution model similar to that of AB Tasty.

Ultimately, A/B test performance analysis relies heavily on data attribution, and our exploration of the differences between AB Tasty and Google Analytics highlighted significant distinctions in the way these tools attribute user interactions. These divergences are the result of different designs and distinct objectives.

From campaign performance to holistic analysis: Which is the right solution for you?

AB Tasty, as a solution dedicated to the experimentation and optimization of user experiences, stands out for its more specialized approach to attribution. It offers a clear and specific view of A/B test performance, by grouping attribution data according to campaign objectives.

Making a modification on a platform and testing it aims to measure the impact of this modification on the performance of the platform and its metrics, during the current session and during future sessions of the same user.

On the other hand, Google Analytics focuses on the overall analysis of site activity. It’s a powerful tool for gathering data on the entire user journey, from traffic sources to conversions. However, its approach to attribution is broader, encompassing all session data, which can lead to different data cross-referencing and analysis than AB Tasty, as we have seen in our example.

It’s essential to note that one is not necessarily better than the other, but rather adapted to different needs.

Teams focusing on the targeted improvement of cross-session user experiences will find significant value in the attribution offered by AB Tasty.
On the other hand, Google Analytics remains indispensable for the holistic analysis of user behavior on a site.

The key to effective use of these solutions lies in understanding their differences in attribution, and the ability to exploit them in complementary ways. Ultimately, the choice will depend on the specific objectives of your analysis, and the alignment of these tools with your needs will determine the quality of your insights.

The post Harmony or Dissonance: Decoding Data Divergence Between AB Tasty and Google Analytics appeared first on abtasty.

How to Effectively Use Sequential Testing

Hubert Wassner — Tue, 24 Oct 2023 13:38:36 +0000

In A/B tests where you can see the data coming in a continuous stream, it’s tempting to stop the experiment before the planned end. It’s so tempting that in fact a lot of practitioners don’t even really know why one has to define a testing period beforehand.

Some platforms have even changed their statistical tools to take this into account and have switched to sequential testing which is designed to handle tests this way.

Sequential testing enables you to evaluate data as it’s collected to determine if an early decision can be made, helping you cut down on A/B test duration as you can ‘peak’ at set points.

But, is this an efficient and beneficial type of testing? Spoiler: yes and no, depending on the way you use it.

Why do we need to wait for the predetermined end of the experiment?

Planning and respecting the data collection period of an experiment is crucial. Historical techniques use “fixed horizon testing” that establishes these guidelines for all to follow. If you do not respect this condition, then you don’t have the guarantee provided by the statistical framework. This statistical framework guarantees that you only have a 5% error risk when using the common decision thresholds.

Sequential testing promises that when using the proper statistical formulas, you can stop an experiment as soon as the decision threshold is crossed and still have the 5% error risk guarantee. The test user here is the sequential Z-test, which is based on the classical Z-test with an added correction to take the sequential usage into account.

In the following sections, we will look at two objections that are often raised when it comes to sequential testing that may put it at odds with CRO practices.

Sequential testing objection 1: “Each day has to be sampled the same”

The first objection is that one should sample each day of the week the same way. This is basically to have a sampling that represents reality. This is the case in a classic A/B test. However, this rule may be broken if you use sequential testing since you can stop the test mid-week but this is not always applicable. Since in reality there are seven different days, your sampling unit should be by week and not by day to account for behavioral differences over the course of a week.

As experiments typically last 2-3 weeks, then the promise of sequential testing saving days isn’t necessarily correct unless a winner appears very early in the process. However, it’s more likely that the statistical test yielded significance during the last week. In this case, it’s best to complete the data collection until each day is sampled evenly so that the full period is covered.

Let’s consider the following simulation setting:

One reference with a 5% conversion rate
One variation with a 5.5% conversion rate (a 10% relative improvement)
5,000 visitors as daily traffic
14 days (2 weeks) of data collection
We ran thousands of such experiments to get histograms for different decision index

In the following histogram, the horizontal axis is the day when the sequential testing crosses the significance threshold. The vertical axis is the ratio of experiments which stopped on this day.

In this setting, day 10 is the most likely day for the sequential testing to reach significance. This means that you will need to wait until the planned end of the test to respect the “same sampling each day” rule. And it’s very unlikely that you will get a significant positive result in one week. Thus, in practice, determining the winner sooner with sequential testing doesn’t apply in CRO.

Sequential testing objection 2: “Yes, (effect) size does matter”

In sequential testing, this is often a less obvious problem and may need some further clarification to be properly understood.

In CRO, we consider mainly two statistical indices for decision-making:

The pValue or any other confidence index, which is linked to the fact that there exists (or not) a difference between the original and the variation. This index is used to validate or invalidate the test hypothesis. But a validated hypothesis is not necessarily a good business decision, so we need more information.
The Confidence Interval (CI) around the estimated gain, which indicates the size of the effect. It’s also central to business decisions. For instance, a variation can be a clear winner but with a very little margin that may not cover the implementation or operating costs such as coupon offerings that need to cover the coupon cost.

Confidence intervals can be seen as a best and worst case scenario. For example a CI = [1% ; 12%] means “in the worst case you will only get 1% relative uplift,” which means going from 5% conversion rate to 5.05%.

If the variation has an implementation or operating cost, the results may not be satisfying. In that case, the solution would be to collect more data in order to have a narrower confidence interval, until you get a more satisfying lower bound, or you may find that the upper bound goes very low showing that the effect, even if it exists, is too low to be worth it from a business perspective.

Using the same scenario as above, the lower bound of the confidence interval can be plotted as follows:

Horizontal axis – the percentage value of the lower bound
Vertical axis – the proportion of experiments with this lower bound value
Blue curve – sequential testing CI
Orange curve – classical fixed horizon testing

We can see that sequential testing has a very low confidence interval for the lower bound. Most of the time, this is lower than 2% (in relative gain, which is very small). This means that you will get very poor information for business decisions.

Meanwhile, a classic fixed horizon testing (orange curve) will produce a lower bound >5% in half of the cases, which is a more comfortable margin. Therefore, you can continue the data collection until you have a useful result, which means waiting for more data. Even if by chance the sequential testing found a variant reaching significance in one week, you will still need to collect data for another week to do two things: have a useful estimation of the uplift and sample each day equally.

This makes sense in light of the purpose of sequential testing: quickly detect when a variation produces results that differ from the original, whether for the worse or better.

If done as soon as possible, it makes sense to stop the experiment as soon as the gain confidence interval lays mostly either on the positive or negative side. Then, for the positive side, the CI lower bound is close to 0, which doesn’t allow for efficient business decisions. It’s worth noting that for other applications other than CRO, this behaviour may be optimal and that’s why sequential testing exists.

When does sequential testing in CRO make sense?

As we’ve seen, sequential testing should not be used to quickly determine a winning variation. However, it can be useful in CRO in order to detect losing variations as soon as possible (and prevent loss of conversions, revenue, …).

You may be wondering why it’s acceptable to stop an experiment midway through when your variation is losing rather than when you have a winning variation. This is because of the following reasons:

The most obvious one: To put it simply, you’re losing conversions. This is acceptable in the context of searching for a better variation than the original. However, this makes little sense in cases where there is a notable loss, indicating that the variation has no more chances to be a winner. An alerting system set at a low sensitivity level will help detect such impactful losses.
The less obvious one: Sometimes when an experiment is only slightly “losing” for a good period of time, practitioners tend to let this kind of test run in the hopes that it may turn into a “winner”. Thus, they accept this loss because the variation is only “slightly” losing but they often forget that another valuable component is lost in the process: traffic, which is essential for experimentation. For an optimal CRO strategy, one needs to take these factors into account and consider stopping this kind of useless experiment, doomed to have small effects. In such a scenario, an automated alert system will suggest stopping this kind of test and allocate this traffic to other experiments.

Therefore, sequential testing is, in fact, a valuable tool to alert and stop a losing variation.

However, one more objection could still be raised: by stopping the experiment midway, you are breaking the “sample each day the same” rule.

In this particular case, stopping a losing variation has very little chance to be a bad move. In order for the detected variation to become a winner, it first needs to gain enough conversions tobe comparable to the original version. Then it would need another set of conversions to be a “mild” winner and that still wouldn’t be enough to be considered a business winner (and cover the implementation or exploitation costs of that winner). To be considered a winner for your business, the competing variation will need another high amount of conversions with a sufficient margin. This margin needs to be high enough to cover the cost of implementation, localization, and/or operating costs.

All the aforementioned events should happen in less than a week (ie. the number of days needed to complete the current week). This is very unlikely, which means it’s safe and smart to stop such experiments.

Conclusion

It may be surprising or disappointing to see that there’s no business value in stopping winning experiments early as others may believe. This is because a statistical winner is not a business winner. Stopping a test early is taking away the data you need to reach a significant effect size that would increase your chances of getting a winning variation.

With that in mind, the best way to use this type of testing is as an alert to help spot and stop tests that are either harmful to the business or not worth continuing.

About the Author:

Hubert Wassner has been working as a Senior Data Scientist at AB Tasty since 2014. With a passion for science, data and technology, his work has focused primarily on all the statistical aspects of the platform, which includes building Bayesian statistical tests adapted to the practice of A/B testing for the web and setting up a data science team for machine learning needs.

After getting his degree in Computer Science with a speciality in Signal Processing at ESIEA, Hubert started his career as a research engineer doing research work in the field of voice recognition in Switzerland followed by research in the field of genomic data mining at a biotech company. He was also a professor at ESIEA engineer school where he taught courses in algorithmics and machine learning.

The post How to Effectively Use Sequential Testing appeared first on abtasty.

Inconclusive A/B Test Results – What’s Next?

Hubert Wassner — Thu, 05 Oct 2023 07:56:01 +0000

Have you ever had an experiment leave you with an unexpected result and were unsure of what to do next? This is the case for many when receiving neutral, flat, or inconclusive A/B test results and this is a question we aim to answer.

In this article, we are going to discuss what an inconclusive experimentation result is, what you can learn from it, and what the next step is when you receive this type of result.

What is an inconclusive experiment result?

We have two definitions for an inconclusive experiment: a practitioner’s answer and a more broken-down answer. A basic practitioner’s answer is a numerical answer that shows statistical information depending on the platform you’re using:

The probability of a winner is less than 90-95%
The pValue is bigger than 0.05
The lift confidence interval includes 0

In other words, an inconclusive result happens when the results of an experiment are non-statistically significant or an uplift is too small to be measured.

However, let’s take note of the true meaning of “significance” in this case: the significance is the threshold one has previously set as a metric or a statistic for measurement. If this previously set threshold is crossed, then an action will be made, usually implementing the winning variation.

Setting thresholds for experimentation

It’s important to note that the user sets the threshold and there are no magic formulas for calculating a threshold value. The only mandatory thing that must be done is that the threshold must be set before the beginning of an experiment. In doing so, this statistical hypothesis protocol provides caution and mitigates the risks of making a poor decision or missing an opportunity during experimentation.

To set a proper threshold, you will need a mix of statistical and business knowledge considering the context.

There is no golden rule, but there is a widespread consensus for using a “95% significance threshold.” However, it’s best to use this generalization cautiously as using the 95% threshold may be a bad choice in some contexts.

To make things simple, let’s consider that you’ve set a significance threshold that fits your experiment context. Then, having a “flat” result may have different meanings – we will dive into this more in the following sections.

The best tool: the confidence interval (CI)

The first thing to do after the planned end of an experiment is to check the confidence interval (CI) that can tell useful information without any notion of significance. The usage is a 95% confidence level to build these intervals. This means that there is a 95% chance that the real value lies between its boundaries. You can consider the boundaries to be an estimate of the best and worst-case scenarios.

Let’s say that your experiment is collaborating with a brand ambassador (or influencer) to attract more attention and sales. You want to see the impact the brand ambassador has on the conversion rate. There are several possible scenarios depending on the CI values:

Scenario 1:

The confidence interval of the lift is [-1% : +1%]. This means that in the best-case scenario, this ambassador effect is a 1% gain and in the worst-case scenario, the effect is -1%. If this 1% relative gain is less than the cost of the ambassador, then you know that it’s okay to stop this collaboration.

A basic estimation can be done by taking this 1% of your global revenue from an appropriate period. If this is smaller than the cost of the ambassador, then there is no need for “significance“ to validate the decision – you are losing money.

Sometimes neutrality is a piece of actionable information.

Scenario 2:

The confidence interval of the lift is [-1% : +10%]. Although this sounds promising, it’s important not to make quick assumptions. Since the 0 is still in the confidence interval, you’re still unsure if this collaboration has a real impact on conversion. In this case, it would make sense to extend the experiment period because there are more chances that the gain will be positive than negative.

It’s best to extend the experimentation period until the left bound gets to a “comfortable” margin.

Let’s say that the cost of the collaboration is covered if the gain is as small as 3%, then any CI [3%, XXX%] will be okay. With a CI like this, you are ensuring that the worst-case scenario is even. And with more data, you will also have a better estimate of the best-case scenario, which will certainly be lower than the initial 10%.

Important notice: do not repeat this too often, otherwise you may be waiting until your variant beats the original just by chance.

When extending a testing period, it’s safer to do it by looking at the CI rather than the “chances to win” or P-value, because the CI provides you with an estimate of the effect size. When the variant wins only by chance (which you increase when extending the testing period), it will yield a very small effect size.

You will notice the size of the gain by looking at the CI, whereas a p-value (or any statistical index) will not inform you about the size. This is a known statistical mistake called p-hacking. P-hacking is basically running an experiment until you get what you expect.

The dangers of P-hacking in experimentation

It’s important to be cautious of p-hacking. Statistical tests are meant to be used once. Splitting the analysis into segments, to some extent, can be seen as portraying different experiences. Therefore, if making a unique decision at a 95% significance level means accepting a 5% risk of having a false positive, then checking for 2 segments implicitly leads to doubling this risk to 10% (roughly).

We recommend the following advice may help to mitigate this risk:

Limit the number of segments you are studying to only segments that could have a reason to interact differently with the variation. For example: if it’s a user interface modification (such as the screen size or the navigator used), it may have an impact on how the modification is displayed, but not the geolocation.
Use segments that convey strong information regarding the experiment. For example: Changing the wording of anything may have no link to the navigator used. It may only have an effect on the emotional needs of the visitors, which is something you can capture with new AI technology when using AB Tasty.
Don’t check the smallest segments. The smallest segments will not greatly impact your business overall and are often the least statistically significant. Raising the significance threshold may also be useful to mitigate the risk of having a false positive

Should you extend the experiment period often?

If you notice that you often need to extend the experiment period, you might be skipping an important step in the test protocol: estimating the sample size you need for your experiment.

Unfortunately, many people are skipping this part of the experiment thinking that they can fix it later by extending the period. However, this is bad practice for several reasons:

This brings you close to P-hacking
You may lose time and traffic on tests that will never be significant

Asking a question you can’t know the answer to can be very difficult: what will be the size of the lift? It’s impossible to know. This is one reason why experimenters don’t often use sample size calculators. The reason you test and experiment is because you do not know the outcome.

A far more intuitive approach is to use a Minimal Detectable Effect (MDE) calculator. Based on the base conversion rate and the number of visitors you send to a given experiment, an MDE calculator can help you come up with the answer to the question: what is the smallest effect you may be able to detect? (if it exists).

For example, if the total traffic on a given page is 15k for 2 weeks, and the conversion rate is 3% – the calculator will tell you that the MDE is about 25% (relative). This means that what you are about to test must have a quite big impact: going from 3% to 3.75% (25% relative growth).

If your variant is only changing some colors to a small button, developing an entire experiment may not be worth the time. Even if the new colors are better and give you a small uplift, it will not be significant in the classic statistical way (having a “chance to win” >95% or a p-value < 0.05).

On the other hand, if your variation tests a big change such as offering a coupon or a brand new product page format, then this test has a chance to give usable results in the given period.

Digging deeper into ‘flatness’

Some experiments may appear to be flat or inconclusive when in reality, they need a closer look.

For example, frequent visitors may be puzzled by your changes because they expect your website to remain the same, whereas new visitors may instantly prefer your variation. This combined effect of the two groups may cancel each other out when looking at the overall results instead of further investigating the data. This is why it’s very important to take the time to dig into your visitor segments as it can provide useful insights.

This can lead to very useful personalization where only a given segment will be exposed to the variation with benefits.

What is the next step after receiving an inconclusive experimentation result?

Let’s consider that your variant has no effect at all, or at least not enough to have a business impact. This still means something. If you reach this point, it means that all previous ideas fell short; You discovered no behavioral difference despite the changes you made in your variation.

What is the next step in this case? The next step is actually to go back to the previous step – the hypothesis. If you are correctly applying the testing protocol, you should have stated a clear hypothesis. It’s time to use it now.

There might be several meta-hypotheses about why your hypothesis has not been validated by your experiment:

The signal is too weak. You might have made a change, but perhaps it’s barely noticeable. If you offered free shipping, your visitors might not have seen the message if it’s too low on the page.
The change itself is too weak. In this case, try to make the change more significant. If you have increased the product picture on the page by 5% – it’s time to try 10% or 15%.
The hypothesis might need revision. Maybe the trend is reversed. For instance, if the confidence interval of the gain is more on the negative side, why not try the opposite idea to implement?
Think of your audience. Another consideration is that even if you have a strong belief about your hypothesis, it’s just time to change your mind about what is important for your visitors and try something different.

It’s important to notice that this change is something that you’ve learned thanks to your experiment. This is not a waste of time – it’s another step forward to better knowing your audience.

Yielding an inconclusive experiment

An experiment not yielding a clear winner (or loser), is often called neutral, inconclusive, or flat. This still produces valuable information if you know how and where to search. It’s not an end, it’s just another step further in your understanding of who you’re targeting.

In other words, an inconclusive experiment result is always a valuable result.

The post Inconclusive A/B Test Results – What’s Next? appeared first on abtasty.

Mastering Customer Engagement with Tula

John Hughes — Wed, 13 Sep 2023 14:23:18 +0000

In this webinar, Tula Skincare, the renowned probiotic skincare brand, outline the challenges they face in getting data-driven personalization and customer engagement.

We explore how to access first-party data, including implicit and explicit data, to enhance the customer experience.

We share strategies on how to make customers feel valued and rewarded for sharing their information. How to do that while creating a seamless customer journey. Fun ways you can engage with your audience. Lastly, get insights into the collaborative teams and processes necessary to be able to test and experiment in these areas!

Webinar Highlights:

Accessing first-party data: Implicit and explicit insights
Engaging through interactive quizzes: Tula’s success story
Seamless data collection: Adding value without interruption

The post Mastering Customer Engagement with Tula appeared first on abtasty.