Perform two-proportion z-test to determine if the difference in conversion rates between two variants is statistically significant. Optimize your experiments with confidence.
Last updated: March 2026
| Baseline CR | +10% Relative Lift | +25% Relative Lift | +50% Relative Lift |
|---|---|---|---|
| 2% CR | ~157K | ~25K | ~6K |
| 5% CR | ~62K | ~10K | ~2.5K |
| 10% CR | ~31K | ~5K | ~1.2K |
| 25% CR | ~12.5K | ~2K | ~500 |
CR = Conversion Rate. Sample size shown is for EACH variant. Higher baseline conversion rates require smaller samples. Use an online A/B test calculator for precise power analysis.
A/B testing (also called split testing) is a controlled experiment that compares two versions of a webpage, email, ad, or other asset to determine which performs better. Users are randomly assigned to either version A (control) or version B (treatment), and their behavior is measured to assess the impact of the changes.
The statistical test used here is a two-proportion z-test, which determines whether the difference in conversion rates between the two variants is statistically significant or could have occurred by random chance. The test calculates a z-score, which measures how many standard errors the observed difference is from zero, and a p-value, which represents the probability of observing such a difference if there were truly no effect.
A result is typically considered statistically significant if the p-value is less than 0.05 (5% significance level). This means there's less than a 5% chance the observed difference is due to random variation alone, giving you confidence that variant B genuinely performs differently than variant A.
Input the number of visitors and conversions for your control group (Variant A). This is typically your original or existing version.
Input the number of visitors and conversions for your treatment group (Variant B). This is your new or modified version being tested.
Review the uplift percentage, statistical significance, and confidence interval. If p-value < 0.05, the difference is significant at the 95% confidence level.
Testing a new call-to-action button:
Since p-value (0.0816) > 0.05, the result is not statistically significant at the 95% confidence level. While Variant B shows a 1% higher conversion rate (9.5% vs 8.5%), we cannot confidently say this improvement is real and not due to random chance.
Recommendation: Continue the test with more traffic or consider the practical significance. A 11.76% uplift could be valuable even without statistical certainty, depending on your risk tolerance.
As a rule of thumb, aim for at least 100 conversions per variant for reliable results. Use a sample size calculator before launching your test to determine the traffic needed based on your baseline conversion rate and minimum detectable effect.
Run tests for at least 1-2 full weeks to account for day-of-week effects. Continue until you reach your pre-calculated sample size. Don't stop early because results look good—this inflates false positive rates.
The p-value is the probability of seeing results this extreme if there were truly no difference between variants. A p-value < 0.05 means less than 5% chance results are due to randomness alone.
Yes, but you'll need a different test (like ANOVA or chi-square). Testing multiple variants simultaneously requires more traffic and increases the risk of false positives. Consider Bonferroni correction for multiple comparisons.
The 95% CI shows the range where the true difference likely lies. If it doesn't include zero, the effect is significant. A narrow CI indicates more precise estimates than a wide one.
Statistical significance (p < 0.05) is one criterion. Also consider practical significance: Is the uplift large enough to matter? Does the ROI justify implementation costs?
P-values near 0.05 (like 0.048 or 0.053) shouldn't be treated as definitively different. Consider continuing the test, increasing your significance threshold, or making decisions based on business context.
Pre-define your sample size and don't peek at results repeatedly. Use appropriate significance levels (0.05 is standard). Test one primary metric per experiment. Consider Bayesian methods for continuous monitoring.
Related Tools