Statistical Methodology Archives - abtasty https://www.abtasty.com/topics/statistical-methodology/ Fri, 23 Aug 2024 21:02:17 +0000 en-GB hourly 1 https://wordpress.org/?v=6.4.2 https://www.abtasty.com/wp-content/uploads/2024/02/cropped-favicon-32x32.png Statistical Methodology Archives - abtasty https://www.abtasty.com/topics/statistical-methodology/ 32 32 Frequentist vs Bayesian Methods in A/B Testing https://www.abtasty.com/blog/bayesian-ab-testing/ Tue, 21 May 2024 09:00:50 +0000 https://www.abtasty.com/?p=3822 These terms refer to 2 inferential statistical methods. Debates over which is ‘better’ are fierce - we’ll walk you through the pros and cons.

The post Frequentist vs Bayesian Methods in A/B Testing appeared first on abtasty.

]]>
In A/B testing, there are two main ways of interpreting test results: Frequentist vs Bayesian.

These terms refer to two different inferential statistical methods. Debates over which is ‘better’ are fierce – and at AB Tasty, we know which method we’ve come to prefer.

If you’re shopping for an A/B testing vendor, new to A/B testing or just trying to better interpret your experiment’s results, it’s important to understand the logic behind each method. This will help you make better business decisions and/or choose the best experimentation platform.

Bayesian vs frequentist methods in ab testing

Source

In this article, we discuss these two statistical methods under the inferential statistics umbrella, compare and contrast their strong points and explain our preferred method of measurement.

What is inferential statistics?

Both Frequentist and Bayesian methods are under the umbrella of inferential statistics.

As opposed to descriptive statistics (which describes purely past events), inferential statistics try to infer or forecast future events.

Would version A or version B have a better impact on X KPI?

Side note: If we want to geek out, technically inferential statistics isn’t really forecasting in a temporal sense, but extrapolating what will happen when we apply results to a larger pool of participants. 

What happens if we apply winning version B to my entire website audience? There’s a notion of ‘future’ events in that we need to actually implement version B tomorrow, but in the strictest sense, we’re not using statistics to ‘predict the future.’

For example, let’s say you were really into Olympic sports, and you wanted to learn more about the men’s swimming team. Specifically, how tall are they? Using descriptive statistics, you could determine some interesting facts about ‘the sample’ (aka the team):

  • The average height of the sample
  • The spread of the sample (variance)
  • How many people are below or above the average
  • Etc.

This might fit your immediate needs, but the scope is pretty limited.

What inferential statistics allows you to do is to infer conclusions about samples that are too big to study in a descriptive way.

If you were interested in knowing the average height of all men on the planet, it wouldn’t be possible to go and collect all that data. Instead, you can use inferential statistics to infer that average from different, smaller samples.

Two ways of inferring this kind of information through statistical analysis are the Frequentist and Bayesian methods.

What is the Frequentist statistics method in A/B testing? 

The Frequentist approach is perhaps more familiar to you since it’s more frequently used by A/B testing software (pardon the pun). This method also makes an appearance in college-level stats classes.

This approach is designed to make a decision about a unique experiment.

With the Frequentist approach, you start with the hypothesis that there is no difference between test versions A and B. And at the end of your experiment, you’ll end up with something called a P-Value (probability value).

The P-Value is the probability of obtaining results at least as extreme as the observed results assuming that there is no (real) difference between the experiments.

In practice, the P-Value is interpreted to mean: the probability that there is no difference between your two versions. (That’s why it is often “inverted” with the basic formula p = 1-pValue, in order to express the probability that there is a difference.)

The smaller the P-Value, the higher the chance that there is, in fact, a difference, and also that your hypothesis is wrong.

Frequentist pros:

  • Frequentist models are available in any statistic library for any programming language.
  • The computation of frequentist tests is blazing fast.

Frequentist cons:

  • You only estimate the P-Value at the end of a test, not during. ‘Data peeking’ before a test has ended generates misleading results because it actually becomes several experiments (one experiment each time you peek at the data), whereas the test is designed for one unique experiment.
  • You can’t know the actual gain interval of a winning variation – just that it won.

What is the Bayesian statistics method in A/B testing?

The Bayesian approach looks at things a little differently.

We can trace it back to a charming British mathematician, Thomas Bayes, and his eponymous Bayes’ Theorem.

Bayes Theorem

Source

The Bayesian approach allows for the inclusion of prior information (‘a prior’) into your current analysis. The method involves three overlapping concepts:

  • Prior – information you have from a previous experiment. At the beginning of the experiment, we use a ‘non-informative’ prior (think ’empty’)
  • Evidences –  the data of the current experiment
  • Posterior – the updated information you have from the prior and the evidences. This is what is produced by the Bayesian analysis.

By design, this test can be used for an ongoing experiment. When data peeking, the ‘peeked at data’ can be seen as a prior, and the future incoming data will be the evidence, and so on.

This means ‘data peeking’ naturally fits in the test design. So at each ‘data peeking,’ the posterior computed by the Bayesian analysis is valid.

Crucially for A/B testing in a business setting, the Bayesian approach allows the CRO practitioner to estimate the gain of a winning variation – more on that later.

Bayesian pros:

  • Allows you to ‘peek’ at the data during a test, so you can either stop sending traffic if a variation is tanking or switch earlier to a clear winner.
  • Allows you to see the actual gain of a winning test.
  • By its nature, often rules out the implementation of false positives.

Bayesian cons:

  • Needs a sampling loop, which takes a non-negligible CPU load.  This is not a concern at the user level, but could potentially gum things up at scale.

Bayesian vs Frequentist: which is better?

So, which method is the ‘better’ method?

Let’s start with the caveat that both are perfectly legitimate statistical methods. But at AB Tasty, our customer experience optimization and feature management software, we have a clear preference for the Bayesian a/b testing approach.  Why?

Gain size

One very strong reason is because with Bayesian statistics, you can estimate a range of the actual gain of a winning variation, instead of only knowing that it was the winner, full stop.

In a business setting, this distinction is crucial. When you’re running your A/B test, you’re really deciding whether to switch from variation A to variation B, not whether you choose A or B from a blank slate. You therefore need to consider:

  • The implementation cost of switching to variation B (time, resources, budget)
  • Additional associated costs of variation B (vendor costs, licenses…)

As an example, let’s say you’re a B2B software vendor, and you ran an A/B test on your pricing page. Variation B included a chatbot, whereas version A didn’t. Variation B outperformed variation A, but to implement variation B, you’ll need 2 weeks of developer time to integrate your chatbot into your lead workflow, plus allocate X dollars of marketing budget to pay for the monthly chatbot license.

via GIPHY

You need to be sure the math adds up, and that it’s more cost-effective to switch to version B when these costs are weighed against the size of the test gain. A Bayesian A/B testing approach will let you do that.

Let’s take a look at an example from the AB Tasty reporting dashboard.

In this fictional test, we’re measuring three variations against an original, with ‘CTA clicks’ as our KPI.

AB Tasty reporting

We can see that variation 2 looks like the clear winner, with a conversion rate of 34.5%, compared to the original of 25%. But by looking to the right, we also get the confidence interval of this gain. In other words, a best and worst-case scenario.

The median gain for version 2 is 36.4%, with the lowest possible gain being +2.25% and the highest being 48.40%

These are the lowest and the highest gain markers you can achieve in 95% of cases.

If we break it down even further:

  • There’s a 50% chance of the gain percentage lying above 36.4% (the median)
  • There’s a 50% chance of it lying below 36.4%.
  • In 95% of cases, the gain will lie between +2.25% and +48.40%.
  • There remains a 2.5% chance of the gain lying below 2.25% (our famous false positive) and a 2.5% chance of it lying above 48.40%.

This level of granularity can help you decide whether to roll out a winning test variation across your site.

Are both the lowest and highest ends of your gain markers positive? Great!

Is the interval small, i.e. you’re quite sure of this high positive gain? It’s probably the right decision to implement the winning version.

Is your interval wide but implementation costs are low? No harm in going ahead there, too.

However, if your interval is large and the cost of implementation is significant, it’s probably best to wait until you have more data to shrink that interval. At AB Tasty we generally recommend that you:

  • Wait until you have recorded at least 5,000 unique visitors per variation
  • Let the test run for at least 14 days (two business cycles)
  • Wait until you have reached 300 conversions on the main goal.

Data peeking

Another advantage of Bayesian statistics is that it’s ok for you to ‘peek’ at your data’s results during a test (but be sure not to overdo it…).

Let’s say you’re working for a giant e-commerce platform and you’re running an A/B test involving a new promotional offer. If you notice that version B is performing abysmally – losing you big money – you can stop it immediately!

Conversely, if your test is outperforming, you can switch all of your website traffic to the winning version earlier than if you were relying on the Frequentist method.

This is precisely the logic behind our Dynamic Traffic Allocation feature – and it wouldn’t be possible without Mr. Thomas Bayes.

Dynamic Traffic Allocation

If we pause quickly on the topic of Dynamic Traffic Allocation, we’ll see that it’s particularly useful in business settings or contexts that are volatile or time-limited.

AB Tasty dynamic traffic allocation bayesian

Dynamic Traffic Allocation option in the AB Tasty Interface.

Essentially, (automated) Dynamic Traffic Allocation strikes the balance between data exploitation and exploration.

The test data is ‘explored’ rigorously enough to be confident in the conclusion, and ‘exploited’ early enough so as to not lose out on conversions (or whatever your primary KPI is) unnecessarily. Note that this isn’t manual – a real live person is not interpreting these results and deciding to go or not to go.

Instead, an algorithm is going to make the choice for you, automatically.

In practice, for AB Tasty clients, this means checking the associated box and picking your primary KPI. The platform’s algorithm will then make the determination of if or when to send the majority of your traffic to a winning variation, once it’s determined.

This kind of approach is particularly useful:

  • Optimizing micro-conversions over a short time period
  • When the time span of the test is short (for example, during a holiday sales promotion)
  • When  your target page doesn’t get a lot of traffic
  • When you’re testing 6+ variations

Though you’ll want to pick and choose when to go for this option, it’s certainly a handy one to have in your back pocket.

Want to start A/B testing on your website with a platform that leverages the Bayesian method? AB Tasty is a great example of an A/B testing tool that allows you to quickly set up tests with low code implementation of front-end or UX changes on your web pages, gather insights via an ROI dashboard, and determine which route will increase your revenue.

False Positives

In Bayesian statistics, like with Frequentist methods, there is a risk of what’s called a false positive.

A false positive, as you might guess, is when a test result indicates a variation shows an improvement when in reality it doesn’t.

It’s often the case with false positives that version B gives the same results as version A (not that it performs inadequately compared to version A).

While by no means innocuous, false positives certainly aren’t a reason to abandon A/B testing. Instead, you can adjust your confidence interval to fit the risk associated with a potential false positive.

Gain probability using Bayesian statistics

You’ve probably heard of the 95% gain probability rule of thumb.

In other words, you consider that your test is statistically significant when you’ve reached a 95% certainty level. You’re 95% sure your version B is performing as indicated, but there’s still a 5% risk that it isn’t.

For many marketing campaigns, this 95% threshold is probably sufficient. But if you’re running a particularly important campaign with a lot at stake, you can adjust your gain probability threshold to be even more exact – 97%, 98% or even 99%, practically ruling out the potential for a false positive.

While this seems like a safe bet – and it is the right choice for high-stakes campaigns – it’s not something to apply across the board.

This is because:

  • In order to attain this higher threshold, you’ll have to wait longer for results, therefore leaving you less time to reap the rewards of a positive outcome.
  • You will implicitly only get a winner with a bigger gain (which is rarer), and you will let go of smaller improvements that still could be impactful.
  • If you have a smaller amount of traffic on your web page, you may want to consider a different approach

Bayesian tests limit false positives

Another thing to keep in mind is that because the Bayesian approach provides a gain interval – and because false positives virtually only appear to perform slightly better than in reality – you’re unlikely to implement a false positive in the first place.

A common scenario would be that you run an A/B test to test whether a new promotional banner design increases CTA click-through rates.

Your result says version B performs better with a 95% gain probability but that the gain is minuscule (1% median improvement). Were this to be a false positive, you’re unlikely to deploy the version B promotional banner across your website, since the resources needed to implement it wouldn’t make it worth the minimum again.

But, since a Frequentist approach doesn’t provide the gain interval, you might be more tempted to put in place the false positive. While this wouldn’t be the end of the world – version B likely performs the same as version A – you would be spending time and energy on a modification that won’t bring you any added return.

Bottom line? If you play it too safe and wait for a confidence level that’s too high, you’ll miss out on a series of smaller gains, which is also a mistake.

Wrapping up: Frequentist vs Bayesian

So, which is better, Frequentist or Bayesian?

As we mentioned early, both approaches are perfectly sound, statistical methods.

But at AB Tasty, we’ve opted for the Bayesian approach, since we think it helps our clients make even better business decisions on their web experiments.

It also allows for more flexibility and maximizing returns (Dynamic Traffic Allocation). As for false positives, these can occur whether you go with a Frequentist or Bayesian approach – though you’re less likely to fall for one with the Bayesian approach.

At the end of the day, if you’re shopping for an A/B testing platform, you’ll want to find one that gives you easily interpretable results that you can rely on.

The post Frequentist vs Bayesian Methods in A/B Testing appeared first on abtasty.

]]>
Doing CRO at Scale? AB Tasty’s Testing Alerts Prevent Revenue Loss https://www.abtasty.com/blog/sequential-testing-alerts/ Tue, 17 Oct 2023 13:50:26 +0000 https://www.abtasty.com/?p=133510 When talking about CRO at scale, a common fear is, what if some experiments go wrong? And we mean seriously wrong. The more experiments you have running, the higher the chance that one could negatively impact the business. It might […]

The post Doing CRO at Scale? AB Tasty’s Testing Alerts Prevent Revenue Loss appeared first on abtasty.

]]>
When talking about CRO at scale, a common fear is, what if some experiments go wrong? And we mean seriously wrong. The more experiments you have running, the higher the chance that one could negatively impact the business. It might be a bug in a variant, or it might just be a bad idea – but conversion can drop dramatically, hurting your revenue goals. 

That’s why it makes sense to monitor experiments daily even if the experiment protocol says that one shouldn’t make decisions before the planned end of the experiment. This leads to two questions: 

  • How do you make a good decision using statistical indices that are not meant to be used that way? Checking it daily is known as “P-hacking” or “data peeking” and generates a lot of poor decisions.
  • How do you check tens of experiments daily without diverting the team’s time and efforts?

The solution is to use another statistical framework, adapted to the sequential nature of checking an experiment daily. This is known as sequential testing

How do I go about checking my experiments daily?

Good news, you don’t have to do this tedious job yourself. AB Tasty has developed an automatic alerting system, based on a specialized statistical test procedure, to help you launch experiments at scale with confidence.

Sequential Testing Alerts offer a sensitivity threshold to detect underperforming variations early. It triggers alerts and can pause experiments, helping you cover your losses. It enables you to make data-driven decisions swiftly and optimize user experiences effectively. This alerting system is based on the percentage of conversions and/or traffic lost.

When creating a test, you simply check a box, select the sensitivity level and you’re done for the rest of the experiment’s life. Then, if no alert has been received, you can go ahead and analyze your experiment as usual; but in case of a problem, you’ll be alerted through the notification center. Then you head on over to the report page to assess what happened and make a final decision about the experiment. 

Can I also use this to identify winners faster?

No; this may sound weird but it’s not a good idea, even if some vendors are telling you that you should do it. You may be able to spot a winner faster but the estimation of the effect size will be very poor, preventing you from making wise business decisions.

But what about “dynamic allocation”?

First of all, what is dynamic allocation? As its name suggests, it is an experimentation method where the allocation of traffic is not fixed but depends on the variation’s  performance. While this may be seen as a way to limit the effect of bad variations, we don’t recommend using it for this purpose for several reasons:

  • As you know, the strict testing protocol strongly suggests not to change the allocation during a test, so dynamic allocation is an edge case of A/B testing. We only suggest using it when you have no other choice. Classic use cases are very low traffic, or very short time frames (flash sales, for instance). So if your only concern is avoiding big losses from a bad variation, alerting is a better and safer option that respects the A/B testing protocol.
  • Since dynamic allocation changes the allocation, in order to avoid the risk of missing out on a winner, it always keeps a minimum traffic amount for the supposed losing variation, even if it’s losing by a lot.

Therefore, dynamic allocation isn’t a way to protect revenue but rather to explore options in the context of limited traffic.

Other benefits

Another benefit of the sequential testing alert feature is better usage of your traffic. Experiments with a lot of traffic with no chance to win are also detected by the alerting system. The system will suggest to the user to stop that experiment and make better use of the traffic for other experiments.

The bottom line is: you can’t do CRO at scale without an automated alert system. Receiving such crucial alerts can protect you from experiments that could otherwise cause revenue loss. To learn more about our sequential testing alert and how it works, refer to our documentation.

The post Doing CRO at Scale? AB Tasty’s Testing Alerts Prevent Revenue Loss appeared first on abtasty.

]]>
Sample Ratio Mismatch: What Is It and How Does It Happen? https://www.abtasty.com/blog/sample-ratio-mismatch/ Mon, 28 Nov 2022 10:18:29 +0000 https://www.abtasty.com/?p=103191 A/B testing can bring out a few types of experimental flaws. Yes, you read that right – A/B testing is important for your business, but only if you have trustworthy results. To get reliable results, you must be on the […]

The post Sample Ratio Mismatch: What Is It and How Does It Happen? appeared first on abtasty.

]]>
A/B testing can bring out a few types of experimental flaws.

Yes, you read that right – A/B testing is important for your business, but only if you have trustworthy results. To get reliable results, you must be on the lookout for errors that might occur while testing.

Sample ratio mismatch (SRM) is a term that is thrown around in the A/B testing world. It’s essential to understand its importance during experimentation.

In this article, we will break down the meaning of sample ratio mismatch, how to spot SRM, when it is and is not a problem, why it can happen and how to detect SRM.

Sample ratio mismatch overview

Sample ratio mismatch is an experimental flaw where the expected traffic allocation doesn’t fit with the observed visitor number for each testing variation.

In other words, an SRM is evidence that something went wrong.

Sample ratio mismatch is crucial to be aware of in A/B testing.

Now that you have the basic idea, let’s break this concept down piece by piece.

What is a “sample”?

The “sample” portion of SRM refers to the traffic allocation.

Traffic allocation refers to how the traffic is split toward each test variation. Typically, the traffic will be split equally (50/50) during an A/B test. Half of the traffic will be shown the new variation and the other half will go toward the control version.

This is how an equal traffic allocation will look for a basic A/B test with only one variant:

A/b testing equal traffic allocation

If your test has two variants or even three variants, the traffic will still be allocated equally to each test to ensure that each version receives the same amount of traffic. An equal traffic allocation in an A/B/C test will be split into 33/33/33.

For both A/B and A/B/C tests, traffic can be manipulated in different ways such as 60/40, 30/70, 20/30/50, etc. Although this is possible, it is not a recommended practice to get accurate and trustworthy results from your experiment.

Even by following this best practice guideline, equally allocated traffic will not eliminate the chance of an SRM. This type of mismatch is something that can still occur and must be calculated no matter the circumstances of the test.

Define sample ratio mismatch (SRM)

Now that we have a clear picture of what the “sample” is, we can build a better understanding of what SRM means:

  • SRM happens when the ratio of the sample does not match the desired 50/50 (or even 33/33/33) traffic allocation
  • SRM occurs when the observed traffic allocation to each variant does not match the  allocation chosen for the test
  • The control version and variation receive undesired mismatched samples

Whichever words you choose to describe SRM, we can now understand our original definition with more confidence:

“Sample ratio mismatch is an experimental flaw where the expected traffic allocation doesn’t fit with the observed visitor number for each testing variation.”

sample ratio mismatch

Is SRM always a problem?

To put it simply, SRM occurs when one test version receives a noticeably different amount of visitors than what was originally expected.

Imagine that you have set up a classic A/B test: Two variations with 50/50 traffic allocation. You notice at one point that version A receives 10,000 visitors and version B receives 10,500 visitors.

Is this truly a problem? What exactly happened in this scenario?

The problem is that while conducting an A/B test, an extremely strict respect of the allocation scheme is not always 100% possible since it must be random. The small difference in traffic that is noted in the example above is something we would typically refer to as a “non-problem.”

If you are seeing a similar traffic allocation on your A/B test in the final stages, there is no need to panic.

A randomly generated traffic split has no way of knowing exactly how many visitors will stumble upon the A/B test during the given time frame of the test. This is why toward the end of the test duration, there may be a smaller difference in the traffic allocation while the majority (+95%) of traffic is correctly allocated.

When is SRM a problem?

Some tests may have SRM due to experimental setup.

When the SRM is a big problem, there will be a noticeable difference in traffic allocation.

If you see 1,000 directed to one variant and 200 directed to the other — this is an issue. Sometimes, spotting SRM does not require a particular mathematical formula dedicated to calculating SRM as it is evident enough on its own.

However, an extreme difference in traffic allocation can be very rare. Therefore, it’s essential to check the visitor counts in an SRM test before each test analysis.

Does SRM occur frequently?

Sample ratio mismatch can happen more often than we think. According to a study done by Microsoft & Booking, about 6% of experiments experience this problem.

Furthermore, if the test includes a redirect to an entirely new page, SRM can be even more likely.

Since we heavily rely on tests and trust their conclusions to make strategic business decisions, it’s important that you are able to detect SRM as early as possible when it happens during your A/B test.

Can SRM still affect tests using Bayesian?

The reality is that everyone needs to be on the lookout for SRM, no matter what type of statistical test they are running. This includes experiments using the Bayesian method.

There are no exemptions to the possibility of experiencing a statistically significant mismatch between the observed and expected results of a test. No matter the test, if the expected assumptions are not met, the results will be unreliable.

Sample ratio mismatch: why it happens

Sample ratio mismatch can happen due to a variety of different root causes. Here we will discuss three common examples that cause SRM.

One common example is when the redirection to one variant isn’t working properly for poorly connected visitors.

Another classic example is when the direct link to one variant is spread on social media, which brings all users who click on the link directly to one of the variants. This error does not allow the traffic to be properly distributed among the variants.

In a more complex case, it’s also possible that a test including JS code is crashing a variant and therefore some of the visitor configurations. In this situation, some visitors that are being sent to the crashing variant won’t be collected and indexed properly, which leads to SRM.

All of these examples have a selection bias: some non-random visitors are excluded. The non-random visitors are arriving directly from a link shared on social media, have a poor connection, or are visiting a crashing variant.

In any case, when these issues occur, the SRM is an indication that something went wrong and you cannot trust the numbers and the test conclusion.

Checking for SRM in your A/B tests

Something important to be aware of when doing an SRM check is that the priority metric when checking needs to be “users” and not “visitors.” Users are the specific people that are allocated to each variation, meanwhile, the visitors metric is counting the number of sessions that each user makes.

It’s important to differentiate between users and visitors because results may be skewed if a visitor comes back to their variation multiple times. SRM detected with “visitors” may not be the most reliable metric, but using the “users” metric is evidence of a problem.

SRM in A/B testing

Testing for sample ratio mismatch may seem a bit complicated or unnecessary at first glance. In reality, it’s quite the opposite.

Understanding what SRM is, why it happens, and how it can affect your results is crucial in A/B testing. Running an A/B test to help make key decisions is only helpful for your business if you have reliable data from those tests.

Want to get started on A/B testing for your website? AB Tasty is a great example of an A/B testing tool that allows you to quickly set up tests with low code implementation of front-end or UX changes on your web pages, gather insights via an ROI dashboard, and determine which route will increase your revenue.

The post Sample Ratio Mismatch: What Is It and How Does It Happen? appeared first on abtasty.

]]>
Statistics: What are Type 1 and Type 2 Errors? https://www.abtasty.com/blog/type-1-and-type-2-errors/ Mon, 17 Oct 2022 16:00:45 +0000 https://www.abtasty.com/?p=19235 Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur. These errors are known as type 1 and type 2 errors. Learn more.

The post Statistics: What are Type 1 and Type 2 Errors? appeared first on abtasty.

]]>
Statistical hypothesis testing implies that no test is ever 100% certain: that’s because we rely on probabilities to experiment.

When online marketers and scientists run hypothesis tests, they’re both looking for statistically relevant results. This means that the results of their tests have to be true within a range of probabilities (typically 95%).

Even though hypothesis tests are meant to be reliable, there are two types of errors that can still occur.

These errors are known as type 1 and type 2 errors (or type i and type ii errors).

Let’s dive in and understand what type 1 and type 2 errors are and the difference between the two.

Type 1 and Type 2 Errors explained

Understanding Type I Errors

Type 1 errors – often assimilated with false positives – happen in hypothesis testing when the null hypothesis is true but rejected. The null hypothesis is a general statement or default position that there is no relationship between two measured phenomena.

Simply put, type 1 errors are “false positives” – they happen when the tester validates a statistically significant difference even though there isn’t one.

Source

Type 1 errors have a probability of  “α” correlated to the level of confidence that you set. A test with a 95% confidence level means that there is a 5% chance of getting a type 1 error.

Consequences of a Type 1 Error

Why do type 1 errors occur? Type 1 errors can happen due to bad luck (the 5% chance has played against you) or because you didn’t respect the test duration and sample size initially set for your experiment.

Consequently, a type 1 error will bring in a false positive. This means that you will wrongfully assume that your hypothesis testing has worked even though it hasn’t.

In real-life situations, this could potentially mean losing possible sales due to a faulty assumption caused by the test.

Related: Sample Size Calculator for A/B Testing

A Real-Life Example of a Type 1 Error

Let’s say that you want to increase conversions on a banner displayed on your website. For that to work out, you’ve planned on adding an image to see if it increases conversions or not.

You start your A/B test by running a control version (A) against your variation (B) that contains the image. After 5 days, variation (B) outperforms the control version by a staggering 25% increase in conversions with an 85% level of confidence.

You stop the test and implement the image in your banner. However, after a month, you noticed that your month-to-month conversions have actually decreased.

That’s because you’ve encountered a type 1 error: your variation didn’t actually beat your control version in the long run.

Related: Frequentist vs Bayesian Methods in A/B Testing

Want to avoid these types of errors during your digital experiments?

AB Tasty is an a/b testing tool embedded with AI and automation that allows you to quickly set up experiments, track insights via our dashboard, and determine which route will increase your revenue.

Understanding Type II Errors

In the same way that type 1 errors are commonly referred to as “false positives”, type 2 errors are referred to as “false negatives”.

Type 2 errors happen when you inaccurately assume that no winner has been declared between a control version and a variation although there actually is a winner.

In more statistically accurate terms, type 2 errors happen when the null hypothesis is false and you subsequently fail to reject it.

If the probability of making a type 1 error is determined by “α”, the probability of a type 2 error is “β”. Beta depends on the power of the test (i.e the probability of not committing a type 2 error, which is equal to 1-β).

There are 3 parameters that can affect the power of a test:

  • Your sample size (n)
  • The significance level of your test (α)
  • The “true” value of your tested parameter (read more here)

Consequences of a Type 2 Error

Similarly to type 1 errors, type 2 errors can lead to false assumptions and poor decision-making that can result in lost sales or decreased profits.

Moreover, getting a false negative (without realizing it) can discredit your conversion optimization efforts even though you could have proven your hypothesis. This can be a discouraging turn of events that could happen to any CRO expert and/or digital marketer.

A Real-Life Example of a Type 2 Error

Let’s say that you run an e-commerce store that sells cosmetic products for consumers. In an attempt to increase conversions, you have the idea to implement social proof messaging on your product pages, like NYX Professional Makeup.

Social Proof Beispiel NYXYou launch an A/B test to see if the variation (B) could outperform your control version (A).

After a week, you do not notice any difference in conversions: both versions seem to convert at the same rate and you start questioning your assumption. Three days later, you stop the test and keep your product page as it is.

At this point, you assume that adding social proof messaging to your store didn’t have any effect on conversions.

Two weeks later, you hear that a competitor had added social proof messages at the same time and observed tangible gains in conversions. You decide to re-run the test for a month in order to get more statistically relevant results based on an increased level of confidence (say 95%).

After a month – surprise – you discover positive gains in conversions for the variation (B). Adding social proof messages under the purchase buttons on your product pages has indeed brought your company more sales than the control version.

That’s right – your first test encountered a type 2 error!

Why are Type I and Type II Errors Important?

Type one and type two errors are errors that we may encounter on a daily basis. It’s important to understand these errors and the impact that they can have on your daily life.

With type 1 errors you are making an incorrect assumption and can lose time and resources. Type 2 errors can result in a missed opportunity to change, enhance, and innovate a project.

To avoid these errors, it’s important to pay close attention to the sample size and the significance level in each experiment.

The post Statistics: What are Type 1 and Type 2 Errors? appeared first on abtasty.

]]>
Bayesian vs. Frequentist: How AB Tasty Chose Our Statistical Model https://www.abtasty.com/blog/bayesian-vs-frequentist/ Tue, 04 Jan 2022 15:59:37 +0000 https://www.abtasty.com/?p=89187 The debate over which is the “best” inferential statistical method is fierce. Discover the pros and cons of each one and which is the winner for us.

The post Bayesian vs. Frequentist: How AB Tasty Chose Our Statistical Model appeared first on abtasty.

]]>
The debate about the best way to interpret test results is becoming increasingly relevant in the world of conversion rate optimization.

Torn between two inferential statistical methods (Bayesian vs. frequentist), the debate over which is the “best” is fierce. At AB Tasty, we’ve carefully studied both of these approaches and there is only one winner for us.

There are a lot of discussions regarding the optimal statistical method: Bayesian vs. frequentist.
There are a lot of discussions regarding the optimal statistical method: Bayesian vs. frequentist (Source)

 

But first, let’s dive in and explore the logic behind each method and the main differences and advantages that each one offers. In this article, we’ll go over:

[toc]

 

What is hypothesis testing?

The statistical hypothesis testing framework in digital experimentation can be expressed as two opposite hypotheses:

  • H0 states that there is no difference between the treatment and the original, meaning the treatment has no effect on the measured KPI.
  • H1 states that there is a difference between the treatment and the original, meaning that the treatment has an effect on the measured KPI.

 

The goal is to compute indicators that will help you make the decision of whether to keep or discard the treatment (a variation, in the context of AB Tasty) based on the experimental data. We first determine the number of visitors to test, collect the data, and then check whether the variation performed better than the original.

There are two hypothesis in the statistical hypothesis framework.
There are two hypotheses in the statistical hypothesis framework (Source)

 

Essentially, there are two approaches to statistical hypothesis testing:

  1. Frequentist approach: Comparing the data to a model.
  2. Bayesian approach: Comparing two models (that are built from data).

 

From the first moment, AB Tasty chose the Bayesian approach for conducting our current reporting and experimentation efforts.

 

What is the frequentist approach?

In this approach, we will build a model Ma for the original (A) that will give the probability P to see some data Da. It is a function of the data:

Ma(Da) = p

Then we can compute a p-value, Pv, from Ma(Db), which is the probability to see the data measured on variation B if it was produced by the original (A).

Intuitively, if Pv is high, this means that the data measured on B could also have been produced by A (supporting hypothesis H0). On the other hand, if Pv is low, this means that there are very few chances that the data measured on B could have been produced by A (supporting hypothesis H1).

A widely used threshold for Pv is 0.05. This is equivalent to considering that, for the variation to have had an effect, there must be less than a 5% chance that the data measured on B could have been produced by A.

This approach’s main advantage is that you only need to model A. This is interesting because it is the original variation, and the original exists for a longer time than B. So it would make sense to believe you could collect data on A for a long time in order to build an accurate model from this data. Sadly, the KPI we monitor is rarely stationary: Transactions or click rates are highly variable over time, which is why you need to build the model Ma and collect the data on B during the same period to produce a valid comparison. Clearly, this advantage doesn’t apply to a digital experimentation context.

This approach is called frequentist, as it measures how frequently specific data is likely to occur given a known model.

It is important to note that, as we have seen above, this approach does not compare the two processes.

Note: since p-value are not intuitive, they are often changed into probability like this:

p = 1-Pvalue

And wrongly presented as the probability that H1 is true (meaning a difference between A & B exists). In fact, it is the probability that the data collected on B was not produced by process A.

 

What is the Bayesian approach (used at AB Tasty)?

In this approach, we will build two models, Ma and Mb (one for each variation), and compare them. These models, which are built from experimental data, produce random samples corresponding to each process, A and B. We use these models to produce samples of possible rates and compute the difference between these rates in order to estimate the distribution of the difference between the two processes.

Contrary to the first approach, this one does compare two models. It is referred to as the Bayesian approach or method.

Now, we need to build a model for A and B.

Clicks can be represented as binomial distributions, whose parameters are the number of tries and a success rate. In the digital experimentation field, the number of tries is the number of visitors and the success rate is the click or transaction rate. In this case, it is important to note that the rates we are dealing with are only estimates on a limited number of visitors. To model this limited accuracy, we use beta distributions (which are the conjugate prior of binomial distributions).

These distributions model the likelihood of a success rate measured on a limited number of trials.

Let’s take an example:

  • 1,000 visitors on A with 100 success
  • 1,000 visitors on B with 130 success

 

We build the model Ma = beta(1+success_a,1+failures_a) where success_a = 100 & failures_a = visitors_a – success_a =900.

You may have noticed a +1 for success and failure parameters. This comes from what is called a “prior” in Bayesian analysis. A prior is something you know before the experiment; for example, something derived from another (previous) experiment. In digital experimentation, however, it is well documented that click rates are not stationary and may change depending on the time of the day or the season. As a consequence, this is not something we can use in practice; and the corresponding prior setting, +1, is simply a flat (or non-informative) prior, as you have no previous usable experiment data to draw from.

For the three following graphs, the horizontal axis is the click rate while the vertical axis is the likelihood of that rate knowing that we had an experiment with 100 successes in 1,000 trials.

(Source: AB Tasty)

 

What usually occurs here is that 10% is the most likely, 5% or 15% are very unlikely, and 11% is half as likely as 10%.

The model Mb is built the same way with data from experiment B:

Mb= beta(1+100,1+870)

 

(Source: AB Tasty)

 

For B, the most likely rate is 13%, and the width of the curve’s shape is close to the previous curve.

Then we compare A and B rate distributions.

Blue is for A and orange is for B (Source: AB Tasty)

 

We see an overlapping area, 12% conversion rate, where both models have the same likelihood. To estimate the overlapping region, we need to sample from both models to compare them.

We draw samples from distribution A and B:

  • s_a[i] is the i th sample from A
  • s_b[i] is the i th sample from B

 

Then we apply a comparison function to these samples:

  • the relative gain: g[i] =100* (s_b[i] – s_a[i])/s_a[i] for all i.

 

It is the difference between the possible rates for A and B, relative to A (multiplied by 100 for readability in %).

We can now analyze the samples g[i] with a histogram:

The horizontal axis is the relative gain, and the vertical axis is the likelihood of this gain (Source: AB Tasty)

 

We see that the most likely value for the gain is around 30%.

The yellow line shows where the gain is 0, meaning no difference between A and B. Samples that are below this line correspond to cases where A > B, samples on the other side are cases where A < B.

We then define the gain probability as:

GP = (number of samples > 0) / total number of samples

 

With 1,000,000 (10^6) samples for g, we have 982,296 samples that are >0, making B>A ~98% probable.

We call this the “chances to win” or the “gain probability” (the probability that you will win something).

The gain probability is shown here (see the red rectangle) in the report:

(Source: AB Tasty)

 

Using the same sampling method, we can compute classic analysis metrics like the mean, the median, percentiles, etc.

Looking back at the previous chart, the vertical red lines indicate where most of the blue area is, intuitively which gain values are the most likely.

We have chosen to expose a best- and worst-case scenario with a 95% confidence interval. It excludes 2.5% of extreme best and worst cases, leaving out a total of 5% of what we consider rare events. This interval is delimited by the red lines on the graph. We consider that the real gain (as if we had an infinite number of visitors to measure it) lies somewhere in this interval 95% of the time.

In our example, this interval is [1.80%; 29.79%; 66.15%], meaning that it is quite unlikely that the real gain is below 1.8 %, and it is also quite unlikely that the gain is more than 66.15%. And there is an equal chance that the real rate is above or under the median, 29.79%.

The confidence interval is shown here (in the red rectangle) in the report (on another experiment):

(Source: AB Tasty)

 

What are “priors” for the Bayesian approach?

Bayesian frameworks use the term “prior” to refer to the information you have before the experiment. For instance, a common piece of knowledge tells us that e-commerce transaction rate is mostly under 10%.

It would have been very interesting to incorporate this, but these assumptions are hard to make in practice due to the seasonality of data having a huge impact on click rates. In fact, it is the main reason why we do data collection on A and B at the same time. Most of the time, we already have data from A before the experiment, but we know that click rates change over time, so we need to collect click rates at the same time on all variations for a valid comparison.

It follows that we have to use a flat prior, meaning that the only thing we know before the experiment is that rates are in [0%, 100%], and that we have no idea what the gain might be. This is the same assumption as the frequentist approach, even if it is not formulated.

 

Challenges in statistics testing

As with any testing approach, the goal is to eliminate errors. There are two types of errors that you should avoid:

  • False positive (FP): When you pick a winning variation that is not actually the best-performing variation.
  • False negative (FN): When you miss a winner. Either you declare no winner or declare the wrong winner at the end of the experiment.

Performance on both these measures depends on the threshold used (p-value or gain probability), which depends, in turn, on the context of the experiment. It’s up to the user to decide.

Another important parameter is the number of visitors used in the experiment, since this has a strong impact on the false negative errors.

From a business perspective, the false negative is an opportunity missed. Mitigating false negative errors is all about the size of the population allocated to the test: basically, throwing more visitors at the problem.

The main problem then is false positives, which mainly occur in two situations:

  • Very early in the experiment: Before reaching the targeted sample size, when the gain probability goes higher than 95%. Some users can be too impatient and draw conclusions too quickly without enough data; the same occurs with false positives.
  • Late in the experiment: When the targeted sample size is reached, but no significant winner is found. Some users believe in their hypothesis too much and want to give it another chance.

 

Both of these problems can be eliminated by strictly respecting the testing protocol: Setting a test period with a sample size calculator and sticking with it.

At AB Tasty, we provide a visual checkmark called “readiness” that tells you whether you respect the protocol (a period that lasts a minimum of 2 weeks and has at least 5,000 visitors). Any decision outside these guidelines should respect the rules outlined in the next section to limit the risk of false positive results.

This screenshot shows how the user is informed as to whether they can take action.

(Source: AB Tasty)

 

Looking at the report during the data collection period (without the “reliability” checkmark) should be limited to checking that the collection is correct and to check for extreme cases that require emergency action, but not for a business decision.

 

When should you finalize your experiment?

Early stopping

“Early stopping” is when a user wants to stop a test before reaching the allocated number of visitors.

A user should wait for the campaign to reach at least 1,000 visitors and only stop if a very big loss is observed.

If a user wants to stop early for a supposed winner, they should wait at least two weeks, and only use full weeks of data. This tactic is interesting if and when the business cost of a false positive is okay, since it is more likely that the performance of the supposed winner would be close to the original, rather than a loss.

Again, if this risk is acceptable from a business strategy perspective, then this tactic makes sense.

If a user sees a winner (with a high gain probability) at the beginning of a test, they should ensure a margin for the worst-case scenario. A lower bound on the gain that is near or below 0% has the potential to evolve and end up below or far below zero by the end of a test, undermining the perceived high gain probability at its beginning. Avoiding stopping early with a low left confidence bound will help rule out false positives at the beginning of a test.

For instance, a situation with a gain probability of 95% and a confidence interval like [-5.16%; 36.48%; 98.02%] is a characteristic of early stopping. The gain probability is above the accepted standard, so one might be willing to push 100% of the traffic to the winning variation. However, the worst-case scenario (-5.16%) is relatively far below 0%. This indicates a possible false positive — and, at any rate, is a risky bet with a worst scenario that loses 5% of conversions. It is better to wait until the lower bound of the confidence interval is at least >0%, and a little margin on top would be even safer.

 

Late stopping

“Late stopping” is when, at the end of a test, without finding a significant winner, a user decides to let the test run longer than planned. Their hypothesis is that the gain is smaller than expected and needs more visitors to reach significance.

When deciding whether to extend the life of a test, not following the protocol, one should consider the confidence interval more than the gain probability.

If the user wants to test longer than planned, we advise to only extend very promising tests. This means having a high best-scenario value (the right bound of the gain confidence interval should be high).

For instance, this scenario: gain probability at 99% and confidence interval at [0.42 %; 3.91%] is typical of a test that shouldn’t be extended past its planned duration: A great gain probability, but not a high best-case scenario (only 3.91%).

Consider that with more samples, the confidence interval will shrink. This means that if there is indeed a winner at the end, its best-case scenario will probably be smaller than 3.91%. So is it really worth it? Our advice is to go back to the sample size calculator and see how many visitors will be needed to achieve such accuracy.

Note: These numerical examples come from a simulation of A/A tests, selecting the failed ones.

 

Confidence intervals are the solution

Using the confidence interval instead of only looking at the gain probability will strongly help improve decision-making. Not to mention that even outside of the problem of false positives, it’s important for the business. All variations need to meet the cost of its implementation in production. One should keep in mind that the original is already there and has no additional cost, so there is always an implicit and practical bias toward the original.

Any optimization strategy should have a minimal threshold on the size of the gain.

Another type of problem may arise when testing more than two variations, known as the multiple comparison problem. In this case, a Holm-Bonferroni correction is applied.

 

Why AB Tasty chose the Bayesian approach

Wrapping up, which is better: the Bayesian vs. frequentist method?

As we’ve seen in the article, both are perfectly sound statistical methods. AB Tasty chose the Bayesian statistical model for the following reasons:

  • Using a probability index that corresponds better to what the users think, and not a p-value or a disguised one;
  • Providing confidence intervals for more informed business decisions (not all winners are really interesting to push in production.). It’s also a means to mitigate false positive errors.

 

At the end of the day, it makes sense that the frequentist method was originally adopted by so many companies when it first came into play. After all, it’s an off-the-shelf solution that’s easy to code and can be easily found in any statistics library (this is a particularly relevant benefit, seeing as how most developers aren’t statisticians).

Nonetheless, even though it was a great resource when it was introduced into the experimentation field, there are better options now — namely, the Bayesian method. It all boils down to what each option offers you: While the frequentist method shows whether there’s a difference between A and B, the Bayesian one actually takes this a step further by calculating what the difference is.

To sum up, when you’re conducting an experiment, you already have the values for A and B. Now, you’re looking to find what you will gain if you change from A to B, something which is best answered by a Bayesian test.

 

The post Bayesian vs. Frequentist: How AB Tasty Chose Our Statistical Model appeared first on abtasty.

]]>
Building the Ideal E-Commerce Tech Stack for Growth https://www.abtasty.com/resources/ecommerce-tech-stack-growth/ Fri, 05 Nov 2021 17:30:43 +0000 https://www.abtasty.com/?post_type=resources&p=86210 The e-commerce world is saturated, crowded, and highly competitive. Businesses today must differentiate through the experiences they provide to their customers, and these days, those experiences are mostly digital. Partners who can help you maximize your customer experience will also […]

The post Building the Ideal E-Commerce Tech Stack for Growth appeared first on abtasty.

]]>
The e-commerce world is saturated, crowded, and highly competitive. Businesses today must differentiate through the experiences they provide to their customers, and these days, those experiences are mostly digital. Partners who can help you maximize your customer experience will also help drive your growth.

If you don’t yet have your tech stack nailed down, if you’re looking to expand your stack, or if you’re looking to get the most out of what you have, it can be difficult to know where to start. Having partners that can support your agility and enable scaling growth is a critical consideration to make.

Challenge

So much of the experience you provide to your customers is predicated on the stack you’ve chosen to support your business. Can your partners help you stay agile to pivot with changing customer behavior? Can your stack enable you to validate impactful ideas before spending time and money on development? Will you be more able to engage and understand your customers?

What you’ll learn

  • Understanding personalization and privacy
  • Refining your experience based on data
  • Client-side or server-side: choosing the right approach for continuous optimization

The post Building the Ideal E-Commerce Tech Stack for Growth appeared first on abtasty.

]]>
The Role of Statistical Significance in A/B Testing https://www.abtasty.com/blog/statistical-significance/ Mon, 21 Jun 2021 12:20:31 +0000 https://www.abtasty.com/?p=19237 Statistical significance is a widely-used concept in statistical hypothesis testing. It indicates the probability that the difference or observed relationship between a variation and a control isn’t due to chance.

The post The Role of Statistical Significance in A/B Testing appeared first on abtasty.

]]>
Statistical significance is a powerful yet often underutilized digital marketing tool. 

A concept that is theoretical and practical in equal measures, you can use statistical significance models to optimize many of your business’s core marketing activities (A/B testing included).

A/B testing is integral to improving the user experience (UX) of a consumer-facing touchpoint (a landing page, checkout process, mobile application, etc.) and increasing its performance while encouraging conversions.

By creating two versions of a particular marketing asset, both with slightly different functions or elements, and analyzing their performance, it’s possible to develop an optimized landing page, email, web app, etc. that yields the best results. This methodology is also referred to as two-sample hypothesis testing.

When it comes to success in A/B testing, statistical significance plays an important role. In this article, we will explore the concept in more detail and consider how statistical significance can enhance the A/B testing process.

But before we do that, let’s look at the meaning of statistical significance.

What is statistical significance and why does it matter?

According to Investopedia, statistical significance is defined as:

“The claim that a result from data generated by testing or experimentation is not likely to occur randomly or by chance but is instead likely to be attributable to a specific cause.”

In that sense, statistical significance will bestow you with the tools to drill down into a specific cause, thereby making informed decisions that are likely to benefit the business. In essence, it’s the opposite of shooting in the dark.

Statistical significance
Make informed decisions with testing and experimentation

Calculating statistical significance

To calculate statistical significance accurately, most people use Pearson’s chi-squared test or distribution.

Invented by Karl Pearson, the chi (which represents ‘x’ in Greek)-squared test commands that users square their data to highlight possible variables.

This methodology is based on whole numbers. For instance, chi-squared is often used to test marketing conversions—a clear-cut scenario where users either take the desired action or they don’t.

In a digital marketing context, people apply Pearson’s chi-squared method using the following formula:

Statistically significant = Probability (p) < Threshold (ɑ)

Based on this notion, a test or experiment is viewed as statistically significant if the probability (p) turns out lower than the appointed threshold (a), also referred to as the alpha. In plainer terms, a test will prove statistically significant if there is a low probability that a result has happened by chance.

Statistical significance is important because applying it to your marketing efforts will give you confidence that the adjustments you make to a campaign, website, or application will have a positive impact on engagement, conversion rates, and other key metrics.

Essentially, statistically significant results aren’t based on chance and depend on two primary variables: sample size and effect size.

Statistical significance and digital marketing

At this point, it’s likely that you have a grasp of the role that statistical significance plays in digital marketing.

Without validating your data or giving your discoveries credibility, you will probably have to take promotional actions that offer very little value or return on investment (ROI), particularly when it comes to A/B testing.

Despite the wealth of data available in the digital age, many marketers are still making decisions based on their gut.

While the shooting in the dim light approach may yield positive results on occasion, to create campaigns or assets that resonate with your audience on a meaningful level, making intelligent decisions based on watertight insights is crucial.

That said, when conducting tests or experiments based on key elements of your digital marketing activities, taking a methodical approach will ensure that every move you make offers genuine value, and statistical significance will help you do so.

Using statistical significance for A/B testing

Now we move on to A/B testing, or more specifically, how you can use statistical significance techniques to enhance your A/B testing efforts.

Testing uses

Before we consider its practical applications, let’s consider what A/B tests you can run using statistical significance:

  • Emails clicks, open rates, and engagements
  • Landing page conversion rates
  • Notification responses
  • Push notification conversions
  • Customer reactions and browsing behaviors
  • Product launch reactions
  • Website calls to action (CTAs)

The statistical steps

To conduct successful A/B tests using statistical significance (the chi-squared test), you should follow these definitive steps:

1. Set a null hypothesis

The idea of the null hypothesis is that it won’t return any significant results. For example, a null hypothesis might be that there is no affirmative evidence to suggest that your audience prefers your new checkout journey to the original checkout journey. Such a hypothesis or statement will be used as an anchor or a benchmark.

2. Create an alternative theory or hypothesis

Once you’ve set your null hypothesis, you should create an alternative theory, one that you’re looking to prove, definitively. In this context, the alternative statement could be: our audience does favor our new checkout journey.

3. Set your testing threshold

With your hypotheses in place, you should set a percentage threshold (the (a) or alpha) that will dictate the validity of your theory. The lower you set the threshold—or (a)—the stricter the test will be. If your test is based on a wider asset such as an entire landing page, then you might set a higher threshold than if you’re analyzing a very specific metric or element like a CTA button, for instance.

For conclusive results, it’s imperative to set your threshold prior to running your A/B test or experiment.

4. Run your A/B test

With your theories and threshold in place, it’s time to run the A/B test. In this example, you would run two versions (A and B) of your checkout journey and document the results.

Here you might compare cart abandonment and conversion rates to see which version has performed better. If checkout journey B (the newer version) has outperformed the original (version A), then your alternative theory or hypothesis will be proved correct.

5. Apply the chi-squared method

Armed with your discoveries, you will be able to apply the chi-squared test to determine whether the actual results differ from the expected results.

To help you apply chi-squared calculations to your A/B test results, here’s a video tutorial for your reference:

By applying chi-squared calculations to your results, you will be able to determine if the outcome is statistically significant (if your (p) value is lower than your (a) value), thereby gaining confidence in your decisions, activities, or initiatives.

6. Put theory into action

If you’ve arrived at a statistically significant result, then you should feel confident transforming theory into practice.

In this particular example, if our checkout journey theory shows a statistically significant relationship, then you would make the informed decision to launch the new version (version B) to your entire consumer base or population, rather than certain segments of your audience.

If your results are not labelled as statistically significant, then you would run another A/B test using a bigger sample.

At first, running statistical significance experiments can prove challenging, but there are free online calculation tools that can help to simplify your efforts.

Statistical significance and A/B testing: what to avoid

While it’s important to understand how to apply statistical significance to your A/B tests effectively, knowing what to avoid is equally vital.

Here is a rundown of common A/B testing mistakes to ensure that you run your experiments and calculations successfully:

  • Unnecessary usage: If your marketing initiatives or activities are low cost or reversible, then you needn’t apply strategic significance to your A/B tests as this will ultimately cost you time. If you’re testing something irreversible or which requires a definitive answer, then you should apply chi-squared testing.
  • Lack of adjustments or comparisons: When applying statistical significance to A/B testing, you should allow for multiple variations or multiple comparisons. Failing to do so will either throw off or narrow your results, rendering them unusable in some instances.
  • Creating biases: When conducting A/B tests of this type, it’s common to apply biases to your experiments unwittingly—the kind of which that don’t consider the population or consumer base as a whole.

To avoid doing this, you must examine your test with a fine-tooth comb before launch to ensure that there aren’t any variables that could push or pull your results in the wrong direction. For example, is your test skewed towards a specific geographical region or narrow user demographic? If so, it might be time to make adjustments.

Statistical significance plays a pivotal role in A/B testing and, if handled correctly, will offer a level of insight that can help catalyze business success across industries.

While you shouldn’t rely on statistical significance for insight or validation, it’s certainly a tool that you should have in your digital marketing toolkit.

We hope that this guide has given you all you need to get started with statistical significance. If you have any wisdom to share, please do so by leaving a comment.

The post The Role of Statistical Significance in A/B Testing appeared first on abtasty.

]]>