ab test sample size calculator - Aaron Graves, PhDude Replica

Baseline conversion rate (%)

Example: if 5 out of 100 users convert, enter 5.

Minimum detectable uplift (%)

Relative lift over baseline (e.g., 10 means detect 5.0% → 5.5%).

Confidence level (%)

Statistical power (%)

Test type

Traffic split to variant B (%)

50 means equal A/B split.

Estimated daily visitors (optional)

Used only to estimate test duration.

Sample size A: -

Sample size B: -

Total sample size: -

Estimated effect conversion rate: -

Expected baseline conversions (A): -

Expected effect conversions (B): -

Estimated duration: -

Absolute lift to detect: -

Note: This calculator uses a normal approximation for a two-sample proportions test. For very low rates, Bayesian methods, sequential designs, or simulation may be better.

Why an A/B test sample size calculator matters

Running an A/B test without enough users is one of the fastest ways to make confident decisions that are completely wrong. If your experiment is underpowered, you may miss meaningful improvements. If you stop too early, random noise can look like a win.

An A/B test sample size calculator helps you answer a simple but critical question: How many users do I need in each variant before trusting the result?

What this calculator does

This tool estimates required sample size for a conversion-rate experiment where:

Variant A is the control with a known baseline conversion rate.
Variant B is expected to improve conversion by a minimum detectable uplift (MDE).
You choose confidence level, power, test type (one-sided or two-sided), and traffic split.

It then returns the recommended users per group and optional test duration if you provide daily traffic.

Input definitions (plain English)

Baseline conversion rate

Your current conversion rate before launching the test. This is your starting point for estimating variance and expected signal.

Minimum detectable uplift (MDE)

The smallest relative improvement you care to detect. If your baseline is 5% and MDE is 10%, the detectable target for B is 5.5%.

Confidence level

How strict you are about false positives (Type I error). A 95% confidence level corresponds to a 5% significance level.

Power

The probability your test detects the effect when the effect is real. 80% is common, but 90% is safer when the decision is expensive.

Traffic split

If you send unequal traffic (e.g., 70/30), one group needs more time to accumulate enough users. Equal splits are generally most sample-efficient.

How to use this in a real experiment workflow

Pick one primary metric (e.g., purchase conversion).
Set MDE based on business value, not wishful thinking.
Pre-commit run length before seeing outcomes.
Avoid peeking every hour and stopping on a temporary spike.
Check data quality: bot traffic, instrumentation errors, and sample ratio mismatch.

Practical interpretation tips

If the required sample size looks huge, that is not a failure of the calculator. It usually means one of three things: your baseline is low, the effect you want to detect is tiny, or your confidence/power thresholds are strict. In these cases, increase traffic, accept a larger MDE, or redesign the experiment for stronger signal.

Also remember: statistical significance is not business significance. A tiny but significant lift may still be meaningless after engineering and rollout costs.

Common mistakes this tool helps prevent

Launching tests with arbitrary durations like “one week” regardless of traffic.
Declaring winners from early fluctuations.
Using very small MDE values without enough traffic capacity.
Ignoring traffic allocation impact on run time.

Frequently asked questions

Should I use one-sided or two-sided tests?

Use two-sided unless you are certain negative effects are irrelevant and you documented that decision before the test starts.

Can I use this for click-through rate and signup rate?

Yes. Any binary conversion outcome (clicked vs. not clicked, subscribed vs. not subscribed, purchased vs. not purchased) fits this setup.

What if my conversion rate is extremely low?

For very rare events, normal approximations may become less stable. Consider exact methods, Bayesian approaches, or simulation-based planning.

Bottom line

A good A/B test is not just a variant and a headline; it is a statistical commitment. Use sample size planning before launch, run to completion, and combine statistical evidence with business context to make better product decisions.