What is A/B Testing?

A/B testing (also called split testing) is a controlled experiment that compares two versions of a webpage, email, ad, or other asset to determine which performs better. Users are randomly assigned to either version A (control) or version B (treatment), and their behavior is measured to assess the impact of the changes.

The statistical test used here is a two-proportion z-test, which determines whether the difference in conversion rates between the two variants is statistically significant or could have occurred by random chance. The test calculates a z-score, which measures how many standard errors the observed difference is from zero, and a p-value, which represents the probability of observing such a difference if there were truly no effect.

A result is typically considered statistically significant if the p-value is less than 0.05 (5% significance level). This means there's less than a 5% chance the observed difference is due to random variation alone, giving you confidence that variant B genuinely performs differently than variant A.

How to Use This Calculator

1

Enter Variant A Data

Input the number of visitors and conversions for your control group (Variant A). This is typically your original or existing version.

2

Enter Variant B Data

Input the number of visitors and conversions for your treatment group (Variant B). This is your new or modified version being tested.

3

Interpret Results

Review the uplift percentage, statistical significance, and confidence interval. If p-value < 0.05, the difference is significant at the 95% confidence level.

Important Considerations:

Ensure visitors are randomly assigned to variants to avoid bias
Run tests long enough to reach adequate sample size (typically 100+ conversions per variant)
Account for day-of-week and time-of-day effects by running full weeks
Don't stop tests early just because results look good (increases false positives)
Consider practical significance alongside statistical significance

Worked Example

Testing a new call-to-action button:

Scenario:

An e-commerce site tests a new checkout button color. After running the test for 2 weeks:

Variant A (Green button): 5,000 visitors, 425 conversions

Variant B (Orange button): 5,000 visitors, 475 conversions

Calculations:

p₁ = 425 / 5000 = 0.0850 (8.50%)

p₂ = 475 / 5000 = 0.0950 (9.50%)

Pooled proportion = (425 + 475) / 10000 = 0.0900

SE = √[0.09 × 0.91 × (1/5000 + 1/5000)] = 0.00574

z = (0.095 - 0.085) / 0.00574 = 1.742

p-value (2-tailed) = 0.0816

Conclusion:

Relative uplift: +11.76% (Variant B vs A)

Since p-value (0.0816) > 0.05, the result is not statistically significant at the 95% confidence level. While Variant B shows a 1% higher conversion rate (9.5% vs 8.5%), we cannot confidently say this improvement is real and not due to random chance.

Recommendation: Continue the test with more traffic or consider the practical significance. A 11.76% uplift could be valuable even without statistical certainty, depending on your risk tolerance.

Frequently Asked Questions

What sample size do I need?

As a rule of thumb, aim for at least 100 conversions per variant for reliable results. Use a sample size calculator before launching your test to determine the traffic needed based on your baseline conversion rate and minimum detectable effect.

How long should I run my test?

Run tests for at least 1-2 full weeks to account for day-of-week effects. Continue until you reach your pre-calculated sample size. Don't stop early because results look good—this inflates false positive rates.

What does the p-value mean?

The p-value is the probability of seeing results this extreme if there were truly no difference between variants. A p-value < 0.05 means less than 5% chance results are due to randomness alone.

Can I test more than 2 variants?

Yes, but you'll need a different test (like ANOVA or chi-square). Testing multiple variants simultaneously requires more traffic and increases the risk of false positives. Consider Bonferroni correction for multiple comparisons.

What's the confidence interval?

The 95% CI shows the range where the true difference likely lies. If it doesn't include zero, the effect is significant. A narrow CI indicates more precise estimates than a wide one.

When is a result 'significant enough'?

Statistical significance (p < 0.05) is one criterion. Also consider practical significance: Is the uplift large enough to matter? Does the ROI justify implementation costs?

What if my results are borderline?

P-values near 0.05 (like 0.048 or 0.053) shouldn't be treated as definitively different. Consider continuing the test, increasing your significance threshold, or making decisions based on business context.

How do I avoid false positives?

Pre-define your sample size and don't peek at results repeatedly. Use appropriate significance levels (0.05 is standard). Test one primary metric per experiment. Consider Bayesian methods for continuous monitoring.

Baseline CR	+10% Relative Lift	+25% Relative Lift	+50% Relative Lift
2% CR	~157K	~25K	~6K
5% CR	~62K	~10K	~2.5K
10% CR	~31K	~5K	~1.2K
25% CR	~12.5K	~2K	~500

A/B Test Calculator

Enter Test Data

AVariant A (Control)

BVariant B (Treatment)

Minimum Sample Sizes for Significance (80% Power, α=0.05)