The A/B Test You're Running Is Wrong: A Guide to Statistical Power
Most A/B tests are stopped too early, run with too little traffic, or declare winning variants based on data that cannot support the conclusion. Statistical power — the probability that your test will detect a real effect if one exists — is the variable almost nobody calculates before running. The result is an industry built on false positives.
How A/B Tests Actually Fail
The most common A/B testing failure is not a statistical error in the classical sense. It is a behavioral one. A test is started, traffic is split, and the team checks results daily. When the numbers look favorable — when the conversion rate for the test variant is running higher than the control — the test is called. The variant is declared a winner. The change goes live.
The problem is that conversion rate data is noisy, especially early in a test. In the first few days, random fluctuations will frequently produce apparent leads for one variant or the other. If you are watching your test and stopping it when the results look favorable, you are guaranteed to overestimate your win rate. Simulations of this stopping behavior — where a test is monitored daily and stopped when p < 0.05 is reached — show false positive rates of 25 to 40 percent, compared to the 5 percent the test was designed to produce. You are declaring winners at six to eight times the appropriate rate.
This is called optional stopping or peeking bias. The mathematical reason is that your significance threshold was calculated assuming you would look at the data exactly once — at the end of a pre-specified sample size. Looking at data multiple times and stopping when it looks significant inflates your Type I error rate in proportion to the number of looks. Running a test until it reaches significance is not a completion criterion. It is a method for generating false positives.
Statistical Power: The Number You Should Calculate First
Statistical power is the probability that your test will correctly detect a real effect, given that one exists. The industry default is 80% power — meaning that if your variant truly improves conversion by the amount you specified, your test has an 80% chance of returning a statistically significant result.
The implication that is almost never stated: at 80% power, if your variant has a real effect, you will miss it 20% of the time. Those missed effects are called Type II errors, or false negatives. A test that fails to reach significance is not proof that the variant did not work. It is entirely consistent with the variant working but your test being too small to detect it.
Power is a function of three variables: your significance threshold (usually α = 0.05), the baseline conversion rate of your control, and the Minimum Detectable Effect (MDE) — the smallest improvement you care to detect. If you want to detect a 5% relative improvement on a 2% conversion rate, you need far more traffic than if you are only trying to detect a 20% improvement. The relationship between these variables is not intuitive. At 80% power, detecting a 5% relative improvement on a 2% conversion rate requires approximately 80,000 visitors per variant. Detecting a 20% relative improvement on the same rate requires approximately 5,000.
The Three Failure Modes
Nearly every broken A/B test fails in one of three ways. Understanding which failure mode you are in determines what you can actually conclude from your data.
Underpowered tests. The test is not run long enough or with enough traffic to detect the effect size that was expected. The result is either a false positive — if the test is peeked at until significance is reached — or a null result that cannot be interpreted. An underpowered null result, where the confidence interval includes both large improvements and large reductions, tells you nothing about whether the variant worked. It tells you only that your test was too small to find out.
Peeking (optional stopping). The test is monitored while running and stopped when results look significant. This inflates Type I error rates to levels that invalidate the stated significance level. A test stopped at p = 0.02 after reaching a pre-planned sample size has a 2% false positive rate. The same test stopped at p = 0.02 after daily monitoring may have a 30% false positive rate. The displayed p-value is correct for a fixed-horizon test. It is not correct for the optional-stopping procedure that was actually used.
Multiple comparison inflation. A test is run with multiple variants, multiple metrics, or both, without adjusting the significance threshold for the number of comparisons. If you test ten variants against a control with a per-comparison significance threshold of p < 0.05, you expect approximately 0.5 false positives even if none of the variants work at all. Multiply this across the number of tests a CRO team runs in a year and the expected number of false positives becomes substantial. Standard corrections — Bonferroni, Benjamini-Hochberg — address this but are rarely applied in practice.
How to Size a Test Before Running It
Correct test sizing requires specifying three things before you begin: your significance threshold (α), your desired power (1 − β), and your minimum detectable effect. The MDE is the most important and least intuitive input. It is not the improvement you hope to achieve — it is the smallest improvement that would be meaningful enough to implement.
A reasonable heuristic for DTC conversion optimization: if the implementation cost of a change is low — a headline rewrite, a button color — an MDE of 5 to 10% relative improvement is sensible. If the implementation cost is high — a checkout flow rebuild, a new landing page architecture — you need to detect larger effects to justify the resource investment. Setting a higher MDE means you need less traffic and shorter run time. Setting a low MDE means you need more traffic and more time, which has real costs of its own.
Most A/B testing platforms include sample size calculators. The output is a minimum number of visitors per variant needed before the test can be read. Treat this as a hard floor, not a guideline. The test cannot be read — even if the numbers look significant — until this minimum has been reached. If you reach your sample size before reaching significance, the correct conclusion is that your variant likely does not produce an improvement above your MDE. That is a valid and actionable result.
Minimum Detectable Effect: The Decision You Have to Make First
The MDE is where most teams make their first mistake. They either skip specifying it — in which case the test runs until someone decides to stop it — or they specify it too small, in which case the required sample size is impractically large.
The right question when setting an MDE is: what is the smallest lift that would change a business decision? If you are testing a checkout page change and your current checkout conversion rate is 3.2%, a 5% relative improvement would take you to 3.36%. Is that meaningful enough to build a roadmap around? If yes, power your test to detect 5%. If the answer is that you only care about lifts above 15%, then power your test for 15% and accept the risk of missing smaller improvements.
There is no objectively correct MDE. There is only the MDE that is consistent with your cost structure, your decision process, and the traffic you have available to run the test. Choosing it explicitly — before running — is what separates a test you can interpret from one you cannot.
Sequential Testing: When You Cannot Wait
If you genuinely cannot wait until a pre-planned sample size is reached — because the cost of running a losing variant is too high — sequential testing methods exist that allow ongoing monitoring without inflating error rates. The correct approach is to use a sequential testing framework from the start, not to run a fixed-horizon test and check it early.
Several platforms — including Optimizely's Stats Engine and VWO's SmartStats — implement Bayesian or sequential testing frameworks that allow continuous monitoring. These are appropriate tools if you understand their tradeoffs. Bayesian testing shifts from asking whether an effect is statistically significant to asking what the probability distribution over the true effect looks like. This framing is often more useful for business decisions but requires specifying a prior belief about the likely effect size, which introduces a subjective element that fixed-horizon frequentist testing avoids.
What to Do With a Null Result
A test that fails to reach significance — assuming it was properly powered before it ran — means one of two things: the variant's true effect is smaller than your MDE, or the variant has no effect. These are not the same, and which scenario applies matters for your next decision.
If the confidence interval on your measured difference is narrow and centered near zero, the evidence is good that the variant is essentially equivalent to control. File the result, update your hypothesis about what works in this context, and move to the next test. If the confidence interval is wide — spanning meaningful improvements and meaningful reductions — the test was underpowered and the result is uninformative. The correct response is to re-run with an adequate sample size, not to conclude the variant failed.
Null results have value that is systematically underweighted in most experimentation programs. A well-powered null result rules out the hypothesis that a change generates a lift above your MDE. That is information. Over time, a program that accurately records null results builds a meaningful map of what does not move the needle in your specific conversion context — which is often more durable and more valuable than a list of positive results that may not replicate.
Source
Johari, Ramesh, et al. 'Peeking at A/B Tests: Why It Matters and What to Do About It.' ACM SIGKDD (2017). Deng, Alex, et al. 'Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.' ACM SIGKDD (2013). Kohavi, Ron, et al. 'Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.' Cambridge University Press (2020).
More articles
View all →The Platforms Grading Their Own Homework: Why Your Attribution Data Is Structurally Broken
A peer-reviewed paper from NeurIPS 2025 formally proves what performance marketers have suspected for years — the mechanism that decides which of your ad platforms gets credit for your conversions is mathematically designed to be gamed.
Incrementality Testing 101: What Every E-Commerce CMO Needs to Know
Incrementality is the question every marketing team should be asking: would these customers have converted without our ads? Here's how to find out — without a data science team.
CAC Reduction: The 4-Step Framework That Cut Acquisition Costs by 35%
A step-by-step breakdown of how we helped one DTC brand identify and eliminate non-incremental spend — reducing CAC by 35% without cutting revenue.
Ready to prove your marketing ROI?
Book a free 30-minute consultation. No commitment, just 30 minutes of clarity on what's actually driving your results.
Book Free ConsultationNo commitment. Just 30 minutes of clarity.