evan miller sample size calculator - Aaron Graves, PhDude Replica

A/B Test Sample Size Calculator

Use this calculator to estimate how many users you need per variation before calling an A/B test. This follows the same core approach popularized by Evan Miller: baseline conversion rate + minimum detectable effect + significance + power.

Baseline conversion rate (%)

Your current conversion rate (control), e.g., 10 for 10%.

Minimum detectable effect (relative lift, %)

The smallest relative improvement worth detecting, e.g., 15 means +15% lift.

Significance level alpha (%)

5% is standard (95% confidence).

Statistical power (%)

80% is common. Higher power requires more traffic.

Total daily visitors available for test (optional)

Used to estimate test runtime. Assume 50/50 split.

What this Evan Miller sample size calculator is doing

When teams run A/B tests, the biggest mistake is often stopping too early. A result can look “promising” after a day or two, but still be mostly noise. This calculator helps you avoid that trap by estimating the minimum users needed in each variant before comparing outcomes.

In practical terms, this tool answers one question: “How much traffic do I need so I can trust my test result?”

Why sample size matters in experimentation

Too small a sample: high chance of false positives and unstable lifts.
Right-sized sample: better confidence in real performance differences.
Too large a sample: slower decisions and wasted opportunity cost.

A good plan balances speed with reliability. That’s exactly what significance and power are for.

Inputs explained

1) Baseline conversion rate

Your current control performance. If your landing page converts 8 out of 100 visitors, baseline is 8%.

2) Minimum Detectable Effect (MDE)

The smallest lift that matters to your business. If baseline is 10% and MDE is 20%, your variant target is 12% (10% × 1.20).

3) Significance level (alpha)

Usually 5%. This controls Type I error (false positive risk).

4) Statistical power

Usually 80% or 90%. This controls Type II error (missing a real effect).

Formula used (two-sided test, two proportions)

The calculator uses the standard normal approximation for two-proportion tests:

n per group = ((z_1-α/2 √(2p̄(1-p̄)) + z_power √(p₁(1-p₁) + p₂(1-p₂)))²) / (p₂ - p₁)²

Where:

p₁ = baseline conversion rate
p₂ = expected variant rate after MDE lift
p̄ = (p₁ + p₂) / 2

How to use this in the real world

Pick a realistic MDE based on business value, not wishful thinking.
Run your test until each variant reaches the estimated sample size.
Avoid peeking and repeatedly stopping/restarting.
Check secondary metrics (revenue per visitor, retention, quality).

Example

Suppose baseline conversion is 10%, MDE is 15%, alpha is 5%, and power is 80%. You may need several thousand users per variant before a reliable read. If your site gets limited traffic, either accept a larger MDE or run longer.

Common pitfalls

Declaring winners after only a few hundred sessions.
Changing targeting rules mid-test.
Using post-hoc segments without correction.
Ignoring practical significance (small but statistically significant lift).

Final note

This calculator is a planning tool, not a guarantee. Real experiments can have seasonality, novelty effects, and tracking issues. Use this estimate to set expectations, then run disciplined tests and validate implementation quality before shipping changes.