ab test calculator sample size

A/B Test Sample Size Calculator

Use this calculator to estimate how many users you need in each variant before you start your experiment.

Assumes a two-sided test with equal split between control and variant.

Why sample size matters in A/B testing

If your A/B test does not have enough users, your result is mostly noise. You may “see” a winner that disappears later, or you may miss a meaningful improvement because the test was underpowered. A proper sample size calculation gives you a realistic target before launch.

In practice, sample size planning protects you from two expensive mistakes: shipping false wins and rejecting real improvements. This is especially important for conversion rate tests on checkout pages, pricing pages, and signup flows where every decision has revenue impact.

How this A/B test sample size calculator works

This calculator performs a power analysis for two proportions (control vs. treatment). You provide your current conversion rate, the minimum effect worth detecting, and your preferred confidence and power settings.

Inputs explained

  • Baseline conversion rate: Your current conversion probability (for example, 8.5%).
  • MDE (Minimum Detectable Effect): The smallest improvement (or change) that is meaningful for the business.
  • Significance level (α): Probability of false positive (commonly 5%).
  • Power: Probability of detecting a true effect of the chosen size (commonly 80% or 90%).
  • Daily visitors + traffic allocation: Used to estimate runtime after sample size is computed.
For most product experiments, α = 5% and power = 80% is a practical default. Use higher power if the decision is high-stakes.

The formula behind the calculator

The required sample per variant is based on a normal approximation for a two-sample proportion test:

n = ((z1-α/2 √(2p̄(1-p̄)) + zpower √(p1(1-p1) + p2(1-p2)))²) / (p2 - p1

where p1 is baseline conversion, p2 is expected conversion after the MDE, and p̄ is the average of p1 and p2.

Example: quick planning scenario

Suppose your checkout conversion rate is 10%, and you care about detecting at least a +2 percentage point absolute lift (to 12%).

  • Baseline: 10%
  • MDE: +2 pp
  • α: 5%
  • Power: 80%

The calculator returns the required sample size per group and total sample size. If you get 5,000 daily visitors and route all traffic into the test, it also estimates the number of days needed.

Common mistakes when choosing sample size

1) Choosing an unrealistic MDE

If your MDE is too tiny, the required sample explodes and tests take forever. If it is too large, you miss improvements that actually matter. Pick an MDE tied to business value.

2) Peeking and stopping early

Continuously checking p-values and stopping at the first “significant” result increases false positives. Decide your stopping rule in advance.

3) Ignoring test duration effects

A test should run through natural cycles (weekday/weekend behavior, campaign shifts, seasonality). Even if you hit sample size quickly, ensure your runtime covers representative traffic patterns.

4) Testing too many variants with the same traffic

More variants split traffic thinner, which increases time to reach required sample. Plan bandwidth accordingly.

Practical recommendations

  • Use historical conversion rate data from the same funnel step.
  • Set MDE based on minimum profitable lift, not wishful thinking.
  • Document your α, power, and stopping criteria before launch.
  • Run quality checks for instrumentation and event tracking before collecting data.
  • Interpret statistical significance alongside practical impact.

FAQ

Should I use absolute or relative MDE?

If stakeholders think in percentage points (e.g., from 5% to 6%), use absolute. If goals are framed as proportional lift (e.g., +15%), use relative.

What is a good default power level?

80% is standard and often sufficient. For expensive product decisions, 90% can be justified, with larger sample requirements.

Does this work for revenue metrics?

This calculator is designed for binary conversion outcomes (converted vs. not converted). Revenue per user generally needs a different model or variance-based approach.

Can I use this for one-tailed tests?

This implementation assumes a two-tailed test, which is safer in most product experimentation contexts.

🔗 Related Calculators