A/B Test Gave Wrong Answer

The Interview Question

"We ran an A/B test on a new checkout flow. The test showed a 5% conversion increase with 95% confidence. We rolled out to 100% of users, but actual conversion dropped 3%. What went wrong?"

Asked at: Google, Meta, Netflix, Booking.com, any data-driven company

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior, with stats knowledge)

Clarifying Questions to Ask

"What was the sample size?" → Too small = noisy results
"How long did the test run?" → Weekday vs weekend effects
"Was it truly random assignment?" → Selection bias?
"What metric exactly?" → Primary vs vanity metrics
"Were there multiple variants?" → Multiple comparison problem

Common Causes of False Positives

Cause 1: Sample Ratio Mismatch (SRM)

Problem: Groups weren't actually 50/50 due to implementation bugs.

# Detection
def check_sample_ratio(control_size, treatment_size, expected_ratio=0.5):
    """
    Chi-squared test for sample ratio mismatch.
    If p < 0.01, there's likely a bug in randomization.
    """
    from scipy import stats
    
    total = control_size + treatment_size
    expected_control = total * expected_ratio
    expected_treatment = total * (1 - expected_ratio)
    
    chi2, p_value = stats.chisquare(
        [control_size, treatment_size],
        [expected_control, expected_treatment]
    )
    
    return {
        'actual_ratio': control_size / total,
        'chi2': chi2,
        'p_value': p_value,
        'has_srm': p_value < 0.01
    }

# Example
result = check_sample_ratio(
    control_size=48000,    # Expected: 50000
    treatment_size=52000   # Expected: 50000
)
# Output: {'has_srm': True, 'p_value': 0.0001}
# ⚠️ SRM detected - randomization is broken!

Common SRM causes:

Bots filtered differently per variant
Redirects causing user loss
Different page load times (users abandon before tracking)
Cookie-based assignment on cookie-less users

Cause 2: Peeking Problem (Early Stopping)

Problem: Checking results repeatedly and stopping when significant.

Day 1: p = 0.42 (not significant)
Day 2: p = 0.23 (not significant)
Day 3: p = 0.08 (almost!)
Day 4: p = 0.04 (significant! ship it!) ← WRONG!

Why it's wrong: With repeated checking, false positive rate balloons.

import numpy as np
from scipy import stats

def simulate_peeking_false_positive_rate(
    n_simulations=10000,
    check_days=[7, 14, 21, 28],
    daily_samples=1000
):
    """
    Simulate false positive rate when peeking at p-values daily.
    True effect is 0 (no difference between groups).
    """
    false_positives = 0
    
    for _ in range(n_simulations):
        control = []
        treatment = []
        
        for day in range(max(check_days)):
            # No true effect - both from same distribution
            control.extend(np.random.normal(0, 1, daily_samples))
            treatment.extend(np.random.normal(0, 1, daily_samples))
            
            if day + 1 in check_days:
                _, p_value = stats.ttest_ind(control, treatment)
                if p_value < 0.05:
                    false_positives += 1
                    break  # Stopped early due to "significance"
    
    return false_positives / n_simulations

# Result: ~15-25% false positive rate instead of 5%!

Solution: Use sequential testing or commit to fixed sample size.

# Sequential testing with Bayesian approach
from scipy.stats import beta

def bayesian_ab_test(control_conversions, control_total,
                     treatment_conversions, treatment_total,
                     threshold=0.95):
    """
    Bayesian A/B test - can check anytime without inflating false positives.
    """
    # Posterior distributions (Beta-Bernoulli model)
    control_alpha = control_conversions + 1
    control_beta = control_total - control_conversions + 1
    
    treatment_alpha = treatment_conversions + 1
    treatment_beta = treatment_total - treatment_conversions + 1
    
    # Monte Carlo simulation
    n_samples = 100000
    control_samples = beta.rvs(control_alpha, control_beta, size=n_samples)
    treatment_samples = beta.rvs(treatment_alpha, treatment_beta, size=n_samples)
    
    prob_treatment_better = np.mean(treatment_samples > control_samples)
    expected_lift = np.mean(treatment_samples - control_samples)
    
    return {
        'prob_treatment_better': prob_treatment_better,
        'expected_lift': expected_lift,
        'significant': prob_treatment_better > threshold or prob_treatment_better < (1 - threshold)
    }

Cause 3: Novelty Effect

Problem: Users interact more with new UI just because it's new.

Test Week 1: +8% engagement
Test Week 2: +5% engagement
Test Week 3: +2% engagement
After launch: -1% engagement  ← Novelty wore off

Solution: Run test longer, segment by new vs returning users.

def analyze_novelty_effect(df):
    """
    Check if effect decays over time - sign of novelty effect.
    """
    # Calculate daily treatment effect
    daily_effects = df.groupby(['date', 'variant']).agg({
        'converted': 'mean'
    }).unstack()
    
    daily_effects['lift'] = (
        daily_effects[('converted', 'treatment')] - 
        daily_effects[('converted', 'control')]
    ) / daily_effects[('converted', 'control')]
    
    # Fit linear regression to check for decay
    from sklearn.linear_model import LinearRegression
    
    X = np.arange(len(daily_effects)).reshape(-1, 1)
    y = daily_effects['lift'].values
    
    model = LinearRegression().fit(X, y)
    slope = model.coef_[0]
    
    if slope < -0.01:  # Negative slope = decaying effect
        print(f"⚠️ Novelty effect detected. Lift decaying at {slope:.2%} per day")
        return True
    return False

Cause 4: Survivor Bias

Problem: Test only measured users who completed the flow, not those who gave up.

Control: 100 users start → 80 complete checkout → 40 buy (50% conversion)
Treatment: 100 users start → 60 complete checkout → 36 buy (60% conversion!)

BUT: Treatment actually lost 20 more users before checkout!
Real conversion: Control 40%, Treatment 36%

Solution: Measure from intent, not from completion.

def calculate_intent_to_purchase(variant_data):
    """
    Calculate conversion from session start, not checkout page load.
    """
    # Wrong: conversion = purchases / checkout_page_views
    wrong_conversion = variant_data['purchases'] / variant_data['checkout_views']
    
    # Right: conversion = purchases / product_page_views
    correct_conversion = variant_data['purchases'] / variant_data['product_views']
    
    return {
        'checkout_conversion': wrong_conversion,  # Misleading
        'true_conversion': correct_conversion      # Accurate
    }

Cause 5: Segment Simpson's Paradox

Problem: Overall effect positive, but negative for every segment.

         Mobile Users    Desktop Users    Overall
Control    5% (n=8000)    15% (n=2000)     7% 
Treatment  4% (n=2000)    14% (n=8000)     12% ← Higher!

But: Treatment is worse for BOTH segments!
The "improvement" is just a shift in user mix.

Solution: Always segment results.

def segmented_analysis(df, segments=['device', 'country', 'user_type']):
    """
    Analyze treatment effect within each segment.
    If effect is inconsistent across segments, investigate.
    """
    results = []
    
    for segment in segments:
        for value in df[segment].unique():
            segment_df = df[df[segment] == value]
            
            control = segment_df[segment_df['variant'] == 'control']['converted']
            treatment = segment_df[segment_df['variant'] == 'treatment']['converted']
            
            effect = treatment.mean() - control.mean()
            _, p_value = stats.ttest_ind(control, treatment)
            
            results.append({
                'segment': segment,
                'value': value,
                'control_rate': control.mean(),
                'treatment_rate': treatment.mean(),
                'effect': effect,
                'p_value': p_value,
                'n_control': len(control),
                'n_treatment': len(treatment)
            })
    
    results_df = pd.DataFrame(results)
    
    # Check for Simpson's Paradox
    effects = results_df['effect']
    if effects.min() * effects.max() < 0:  # Mixed positive/negative
        print("⚠️ WARNING: Inconsistent effects across segments!")
        print("Possible Simpson's Paradox - investigate segment mix")
    
    return results_df

Cause 6: Multiple Testing Problem

Problem: Testing 20 metrics, finding 1 "significant" result.

With 20 independent tests at α=0.05:
P(at least one false positive) = 1 - (0.95)^20 = 64%!

Solution: Bonferroni correction or pre-register primary metric.

def bonferroni_correction(p_values, alpha=0.05):
    """
    Adjust significance threshold for multiple comparisons.
    """
    n_tests = len(p_values)
    adjusted_alpha = alpha / n_tests
    
    results = []
    for metric, p in p_values.items():
        results.append({
            'metric': metric,
            'p_value': p,
            'adjusted_alpha': adjusted_alpha,
            'significant': p < adjusted_alpha
        })
    
    return results

# Example: 20 metrics tested
p_values = {f'metric_{i}': 0.04 for i in range(20)}  # All p=0.04
p_values['metric_3'] = 0.002  # One truly significant

results = bonferroni_correction(p_values)
# Only metric_3 passes (0.002 < 0.05/20 = 0.0025)

Pre-Launch Checklist

a_b_test_checklist:
  before_launch:
    - [ ] Define primary metric (ONE metric)
    - [ ] Calculate required sample size
    - [ ] Set fixed duration (no early peeking)
    - [ ] Verify randomization logic
    - [ ] Test tracking implementation
  
  during_test:
    - [ ] Check for SRM daily
    - [ ] Monitor guardrail metrics (latency, errors)
    - [ ] Don't look at primary metric until end
  
  after_test:
    - [ ] Verify no SRM
    - [ ] Segment analysis
    - [ ] Check for novelty effect
    - [ ] Run for 2 business cycles minimum
    - [ ] Document learnings regardless of outcome

Key Takeaways

Fix sample size upfront - No peeking at p-values
Check for SRM first - Invalid experiment = invalid results
One primary metric - Pre-register before launch
Segment everything - Simpson's Paradox is real
Run long enough - At least 2 weeks to capture weekly patterns
Consider novelty - New != better long-term

Golden rule: If the result seems too good to be true, it probably is. Investigate before shipping.

The Interview Question​

Clarifying Questions to Ask​

Common Causes of False Positives​

Cause 1: Sample Ratio Mismatch (SRM)​

Cause 2: Peeking Problem (Early Stopping)​

Cause 3: Novelty Effect​

Cause 4: Survivor Bias​

Cause 5: Segment Simpson's Paradox​

Cause 6: Multiple Testing Problem​

Pre-Launch Checklist​

Key Takeaways​