Skip to main content

A/B Test Gave Wrong Answer

The Interview Question

"We ran an A/B test on a new checkout flow. The test showed a 5% conversion increase with 95% confidence. We rolled out to 100% of users, but actual conversion dropped 3%. What went wrong?"

Asked at: Google, Meta, Netflix, Booking.com, any data-driven company

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior, with stats knowledge)


Clarifying Questions to Ask

  1. "What was the sample size?" → Too small = noisy results
  2. "How long did the test run?" → Weekday vs weekend effects
  3. "Was it truly random assignment?" → Selection bias?
  4. "What metric exactly?" → Primary vs vanity metrics
  5. "Were there multiple variants?" → Multiple comparison problem

Common Causes of False Positives

Cause 1: Sample Ratio Mismatch (SRM)

Problem: Groups weren't actually 50/50 due to implementation bugs.

# Detection
def check_sample_ratio(control_size, treatment_size, expected_ratio=0.5):
"""
Chi-squared test for sample ratio mismatch.
If p < 0.01, there's likely a bug in randomization.
"""
from scipy import stats

total = control_size + treatment_size
expected_control = total * expected_ratio
expected_treatment = total * (1 - expected_ratio)

chi2, p_value = stats.chisquare(
[control_size, treatment_size],
[expected_control, expected_treatment]
)

return {
'actual_ratio': control_size / total,
'chi2': chi2,
'p_value': p_value,
'has_srm': p_value < 0.01
}

# Example
result = check_sample_ratio(
control_size=48000, # Expected: 50000
treatment_size=52000 # Expected: 50000
)
# Output: {'has_srm': True, 'p_value': 0.0001}
# ⚠️ SRM detected - randomization is broken!

Common SRM causes:

  • Bots filtered differently per variant
  • Redirects causing user loss
  • Different page load times (users abandon before tracking)
  • Cookie-based assignment on cookie-less users

Cause 2: Peeking Problem (Early Stopping)

Problem: Checking results repeatedly and stopping when significant.

Day 1: p = 0.42 (not significant)
Day 2: p = 0.23 (not significant)
Day 3: p = 0.08 (almost!)
Day 4: p = 0.04 (significant! ship it!) ← WRONG!

Why it's wrong: With repeated checking, false positive rate balloons.

import numpy as np
from scipy import stats

def simulate_peeking_false_positive_rate(
n_simulations=10000,
check_days=[7, 14, 21, 28],
daily_samples=1000
):
"""
Simulate false positive rate when peeking at p-values daily.
True effect is 0 (no difference between groups).
"""
false_positives = 0

for _ in range(n_simulations):
control = []
treatment = []

for day in range(max(check_days)):
# No true effect - both from same distribution
control.extend(np.random.normal(0, 1, daily_samples))
treatment.extend(np.random.normal(0, 1, daily_samples))

if day + 1 in check_days:
_, p_value = stats.ttest_ind(control, treatment)
if p_value < 0.05:
false_positives += 1
break # Stopped early due to "significance"

return false_positives / n_simulations

# Result: ~15-25% false positive rate instead of 5%!

Solution: Use sequential testing or commit to fixed sample size.

# Sequential testing with Bayesian approach
from scipy.stats import beta

def bayesian_ab_test(control_conversions, control_total,
treatment_conversions, treatment_total,
threshold=0.95):
"""
Bayesian A/B test - can check anytime without inflating false positives.
"""
# Posterior distributions (Beta-Bernoulli model)
control_alpha = control_conversions + 1
control_beta = control_total - control_conversions + 1

treatment_alpha = treatment_conversions + 1
treatment_beta = treatment_total - treatment_conversions + 1

# Monte Carlo simulation
n_samples = 100000
control_samples = beta.rvs(control_alpha, control_beta, size=n_samples)
treatment_samples = beta.rvs(treatment_alpha, treatment_beta, size=n_samples)

prob_treatment_better = np.mean(treatment_samples > control_samples)
expected_lift = np.mean(treatment_samples - control_samples)

return {
'prob_treatment_better': prob_treatment_better,
'expected_lift': expected_lift,
'significant': prob_treatment_better > threshold or prob_treatment_better < (1 - threshold)
}

Cause 3: Novelty Effect

Problem: Users interact more with new UI just because it's new.

Test Week 1: +8% engagement
Test Week 2: +5% engagement
Test Week 3: +2% engagement
After launch: -1% engagement ← Novelty wore off

Solution: Run test longer, segment by new vs returning users.

def analyze_novelty_effect(df):
"""
Check if effect decays over time - sign of novelty effect.
"""
# Calculate daily treatment effect
daily_effects = df.groupby(['date', 'variant']).agg({
'converted': 'mean'
}).unstack()

daily_effects['lift'] = (
daily_effects[('converted', 'treatment')] -
daily_effects[('converted', 'control')]
) / daily_effects[('converted', 'control')]

# Fit linear regression to check for decay
from sklearn.linear_model import LinearRegression

X = np.arange(len(daily_effects)).reshape(-1, 1)
y = daily_effects['lift'].values

model = LinearRegression().fit(X, y)
slope = model.coef_[0]

if slope < -0.01: # Negative slope = decaying effect
print(f"⚠️ Novelty effect detected. Lift decaying at {slope:.2%} per day")
return True
return False

Cause 4: Survivor Bias

Problem: Test only measured users who completed the flow, not those who gave up.

Control: 100 users start → 80 complete checkout → 40 buy (50% conversion)
Treatment: 100 users start → 60 complete checkout → 36 buy (60% conversion!)

BUT: Treatment actually lost 20 more users before checkout!
Real conversion: Control 40%, Treatment 36%

Solution: Measure from intent, not from completion.

def calculate_intent_to_purchase(variant_data):
"""
Calculate conversion from session start, not checkout page load.
"""
# Wrong: conversion = purchases / checkout_page_views
wrong_conversion = variant_data['purchases'] / variant_data['checkout_views']

# Right: conversion = purchases / product_page_views
correct_conversion = variant_data['purchases'] / variant_data['product_views']

return {
'checkout_conversion': wrong_conversion, # Misleading
'true_conversion': correct_conversion # Accurate
}

Cause 5: Segment Simpson's Paradox

Problem: Overall effect positive, but negative for every segment.

         Mobile Users    Desktop Users    Overall
Control 5% (n=8000) 15% (n=2000) 7%
Treatment 4% (n=2000) 14% (n=8000) 12% ← Higher!

But: Treatment is worse for BOTH segments!
The "improvement" is just a shift in user mix.

Solution: Always segment results.

def segmented_analysis(df, segments=['device', 'country', 'user_type']):
"""
Analyze treatment effect within each segment.
If effect is inconsistent across segments, investigate.
"""
results = []

for segment in segments:
for value in df[segment].unique():
segment_df = df[df[segment] == value]

control = segment_df[segment_df['variant'] == 'control']['converted']
treatment = segment_df[segment_df['variant'] == 'treatment']['converted']

effect = treatment.mean() - control.mean()
_, p_value = stats.ttest_ind(control, treatment)

results.append({
'segment': segment,
'value': value,
'control_rate': control.mean(),
'treatment_rate': treatment.mean(),
'effect': effect,
'p_value': p_value,
'n_control': len(control),
'n_treatment': len(treatment)
})

results_df = pd.DataFrame(results)

# Check for Simpson's Paradox
effects = results_df['effect']
if effects.min() * effects.max() < 0: # Mixed positive/negative
print("⚠️ WARNING: Inconsistent effects across segments!")
print("Possible Simpson's Paradox - investigate segment mix")

return results_df

Cause 6: Multiple Testing Problem

Problem: Testing 20 metrics, finding 1 "significant" result.

With 20 independent tests at α=0.05:
P(at least one false positive) = 1 - (0.95)^20 = 64%!

Solution: Bonferroni correction or pre-register primary metric.

def bonferroni_correction(p_values, alpha=0.05):
"""
Adjust significance threshold for multiple comparisons.
"""
n_tests = len(p_values)
adjusted_alpha = alpha / n_tests

results = []
for metric, p in p_values.items():
results.append({
'metric': metric,
'p_value': p,
'adjusted_alpha': adjusted_alpha,
'significant': p < adjusted_alpha
})

return results

# Example: 20 metrics tested
p_values = {f'metric_{i}': 0.04 for i in range(20)} # All p=0.04
p_values['metric_3'] = 0.002 # One truly significant

results = bonferroni_correction(p_values)
# Only metric_3 passes (0.002 < 0.05/20 = 0.0025)

Pre-Launch Checklist

a_b_test_checklist:
before_launch:
- [ ] Define primary metric (ONE metric)
- [ ] Calculate required sample size
- [ ] Set fixed duration (no early peeking)
- [ ] Verify randomization logic
- [ ] Test tracking implementation

during_test:
- [ ] Check for SRM daily
- [ ] Monitor guardrail metrics (latency, errors)
- [ ] Don't look at primary metric until end

after_test:
- [ ] Verify no SRM
- [ ] Segment analysis
- [ ] Check for novelty effect
- [ ] Run for 2 business cycles minimum
- [ ] Document learnings regardless of outcome

Key Takeaways

  1. Fix sample size upfront - No peeking at p-values
  2. Check for SRM first - Invalid experiment = invalid results
  3. One primary metric - Pre-register before launch
  4. Segment everything - Simpson's Paradox is real
  5. Run long enough - At least 2 weeks to capture weekly patterns
  6. Consider novelty - New != better long-term

Golden rule: If the result seems too good to be true, it probably is. Investigate before shipping.