Search Ranking Disaster

The Interview Question

"We A/B tested a new ML search ranking model. Test showed +8% click-through rate. We rolled out to 100% of users. Two weeks later, revenue is down 15% and user engagement dropped. What happened?"

Asked at: Google, Amazon, Netflix, any company with search/recommendations

Time to solve: 35-40 minutes

Difficulty: ⭐⭐⭐⭐ (Senior ML + Product)

Clarifying Questions to Ask

"What was the A/B test duration?" → Too short = seasonal effects
"What metrics were tracked?" → CTR vs revenue vs user value
"How was the model trained?" → Training data bias?
"What's the user feedback loop?" → Click = engagement ≠ satisfaction
"Did we segment results?" → Some segments might be hurt

Common ML Production Failures

Failure 1: Optimizing the Wrong Metric (Goodhart's Law)

# 🔴 Model optimized for clicks
# But clicks != purchases != long-term value

# Model learned: Show clickbait titles first
# Result: Users click, don't buy, lose trust

# Example:
# Old ranking: [Quality Product, Good Match, Relevant Item]
# New ranking: [AMAZING DEAL 90% OFF!!!, Click Here!!!, ...]

# CTR up ✅
# Conversion down ❌
# LTV down ❌❌

The Fix: Multi-Objective Optimization

class SearchRanker:
    def __init__(self):
        self.click_model = load_model('click_model')
        self.purchase_model = load_model('purchase_model')
        self.satisfaction_model = load_model('satisfaction_model')
    
    def rank(self, query, items, user):
        scores = []
        for item in items:
            features = extract_features(query, item, user)
            
            # Multi-objective score
            click_prob = self.click_model.predict(features)
            purchase_prob = self.purchase_model.predict(features)
            satisfaction = self.satisfaction_model.predict(features)
            
            # Weighted combination - revenue AND satisfaction
            score = (
                0.2 * click_prob +
                0.4 * purchase_prob +
                0.3 * satisfaction +
                0.1 * item.quality_score  # Editorial quality
            )
            scores.append((item, score))
        
        return sorted(scores, key=lambda x: -x[1])

Failure 2: Feedback Loop Amplification

Week 1: Model ranks Product A higher
Week 2: Product A gets more clicks (because it's shown more)
Week 3: Model sees more clicks → ranks A even higher
Week 4: Product A dominates all searches (regardless of relevance)

Meanwhile: Good products never get shown → can't prove they're good

The Fix: Exploration-Exploitation Balance

import numpy as np

class ExplorationRanker:
    def __init__(self, exploration_rate=0.1):
        self.exploration_rate = exploration_rate
    
    def rank(self, query, items, user):
        # Get model scores
        base_ranking = self.model.rank(query, items, user)
        
        # Randomly explore for some percentage of users
        if np.random.random() < self.exploration_rate:
            # Thompson sampling or random shuffle for exploration
            return self.explore_rank(items)
        
        return base_ranking
    
    def explore_rank(self, items):
        """
        Show some items that wouldn't normally be top-ranked.
        Collect data on items with uncertain quality.
        """
        # Mix: 70% model ranking, 30% random
        model_items = items[:7]
        explore_items = np.random.choice(items[7:], size=3, replace=False)
        
        combined = list(model_items) + list(explore_items)
        np.random.shuffle(combined)  # Mix them up
        
        return combined

Failure 3: Position Bias Not Corrected

# 🔴 Problem: Items shown at position 1 always get more clicks
# Model learns: "Product A gets clicks" 
# Reality: "Position 1 gets clicks, A happened to be there"

# Training data:
# | Item | Position | Clicked |
# |------|----------|---------|
# | A    | 1        | Yes     |  ← A was shown first
# | B    | 5        | No      |  ← B never had a chance

# Model conclusion: A > B
# Reality: B might be better if given position 1

The Fix: Inverse Propensity Weighting

class UnbiasedRankingTrainer:
    def __init__(self):
        # Position click-through rates from randomized experiments
        self.position_bias = {
            1: 0.35,  # 35% CTR at position 1
            2: 0.20,
            3: 0.12,
            4: 0.08,
            5: 0.05,
            # ...
        }
    
    def compute_ipw_weight(self, position):
        """
        Inverse Propensity Weight to correct for position bias.
        """
        return 1.0 / self.position_bias.get(position, 0.01)
    
    def train(self, training_data):
        weighted_data = []
        
        for sample in training_data:
            weight = self.compute_ipw_weight(sample.position)
            
            # If clicked at position 5, this is more impressive
            # than clicking at position 1 - weight it higher
            weighted_data.append({
                **sample,
                'weight': weight if sample.clicked else 1.0
            })
        
        return self.model.fit(weighted_data, sample_weight='weight')

Failure 4: Distribution Shift

# 🔴 Model trained on old data, deployed to new reality

# Training data (last year): 
# - Black Friday: 5% of queries
# - Normal shopping: 95%

# Deployment reality (Black Friday week):
# - Black Friday: 80% of queries
# - Normal shopping: 20%

# Model never saw "Black Friday mindset" users at scale
# Recommendations are wrong for deal-seekers

The Fix: Monitor Data Drift

from scipy import stats

class DataDriftMonitor:
    def __init__(self, reference_distribution):
        self.reference = reference_distribution
    
    def detect_drift(self, current_data, threshold=0.05):
        """
        KS test to detect if current data differs from training data.
        """
        drift_detected = {}
        
        for feature in self.reference.columns:
            statistic, p_value = stats.ks_2samp(
                self.reference[feature],
                current_data[feature]
            )
            
            drift_detected[feature] = {
                'statistic': statistic,
                'p_value': p_value,
                'drift': p_value < threshold
            }
            
            if p_value < threshold:
                alert(f"Data drift detected in {feature}!")
        
        return drift_detected
    
    def should_retrain(self):
        """
        Trigger retraining if significant drift detected.
        """
        drift_report = self.detect_drift(get_recent_data())
        drifted_features = [f for f, d in drift_report.items() if d['drift']]
        
        return len(drifted_features) > len(drift_report) * 0.2  # 20% features drifted

Failure 5: A/B Test Was Too Short

# 🔴 One-week test missed important patterns:

# Week 1 (test): New users excited about new UI → +8% CTR
# Week 2: Novelty wears off → +2% CTR
# Week 3: Users can't find what they need → -5% CTR
# Week 4+: Users leave platform → -15% revenue

# Also missed:
# - Weekend vs weekday patterns
# - Monthly paycheck cycle
# - Seasonal products

The Fix: Proper Experiment Design

def calculate_minimum_test_duration(
    baseline_rate: float,
    minimum_detectable_effect: float,
    confidence_level: float = 0.95,
    power: float = 0.80,
    daily_traffic: int = 100000
) -> dict:
    """
    Calculate proper A/B test duration.
    """
    from statsmodels.stats.power import NormalIndPower
    
    # Calculate required sample size
    analysis = NormalIndPower()
    effect_size = minimum_detectable_effect / baseline_rate
    
    sample_size_per_group = analysis.solve_power(
        effect_size=effect_size,
        alpha=1 - confidence_level,
        power=power,
        ratio=1.0,
        alternative='two-sided'
    )
    
    total_samples_needed = sample_size_per_group * 2
    days_needed = total_samples_needed / daily_traffic
    
    # Minimum duration rules
    minimum_days = max(
        days_needed,
        14,  # At least 2 weeks for weekly patterns
        # 28 for monthly patterns if applicable
    )
    
    return {
        'sample_size_per_group': sample_size_per_group,
        'days_needed_statistically': days_needed,
        'recommended_days': minimum_days,
        'reason': 'Include at least 2 weekends and account for novelty effect'
    }

Failure 6: Segment-Level Harm Hidden in Aggregate

# 🔴 Aggregate metrics looked good, but...

# Segment breakdown:
# Power users (10% of users, 40% of revenue): -20% engagement
# Casual users (90% of users, 60% of revenue): +10% engagement

# Net effect: +8% engagement looks good!
# But power users are leaving → long-term revenue disaster

The Fix: Always Segment Analysis

def segmented_experiment_analysis(experiment_data):
    """
    Analyze experiment across key user segments.
    """
    segments = {
        'user_tenure': ['new', '1-6_months', '6-12_months', '1yr+'],
        'activity_level': ['low', 'medium', 'high', 'power'],
        'platform': ['ios', 'android', 'web'],
        'geography': ['us', 'eu', 'apac', 'other'],
        'subscription': ['free', 'premium'],
    }
    
    results = {}
    
    for segment_name, segment_values in segments.items():
        for value in segment_values:
            segment_data = experiment_data[
                experiment_data[segment_name] == value
            ]
            
            control = segment_data[segment_data['variant'] == 'control']
            treatment = segment_data[segment_data['variant'] == 'treatment']
            
            metrics = {
                'ctr': (treatment['clicked'].mean() - control['clicked'].mean()) / control['clicked'].mean(),
                'conversion': (treatment['purchased'].mean() - control['purchased'].mean()) / control['purchased'].mean(),
                'revenue': (treatment['revenue'].sum() - control['revenue'].sum()) / control['revenue'].sum(),
            }
            
            results[f"{segment_name}:{value}"] = metrics
            
            # Alert if any segment shows significant harm
            if metrics['revenue'] < -0.05:  # 5% drop
                alert(f"⚠️ Segment {segment_name}={value} shows revenue drop: {metrics['revenue']:.1%}")
    
    return results

Safe ML Model Rollout

ml_rollout_checklist:

  pre_launch:
    - [ ] Train on diverse, representative data
    - [ ] Evaluate on held-out time periods (not random split)
    - [ ] Check for position bias correction
    - [ ] Define guardrail metrics (must not degrade)
    
  a_b_test:
    - [ ] Run for 2+ weeks minimum
    - [ ] Analyze all key segments
    - [ ] Check novelty effect (is lift decreasing over time?)
    - [ ] Monitor long-term metrics (retention, LTV)
    
  gradual_rollout:
    - [ ] 1% → 5% → 25% → 50% → 100%
    - [ ] Wait 1 week at each stage
    - [ ] Monitor business metrics at each stage
    - [ ] Automatic rollback if guardrails breached
    
  post_launch:
    - [ ] Monitor for feedback loops
    - [ ] Watch for distribution drift
    - [ ] Plan for model refresh/retraining
    - [ ] Document learnings

Monitoring Dashboard

# Key metrics to track for search ranking
RANKING_METRICS = {
    'immediate': [
        'click_through_rate',
        'queries_per_session',
        'search_result_position_clicked',
    ],
    'short_term': [
        'conversion_rate',
        'average_order_value',
        'items_per_order',
    ],
    'long_term': [
        'user_retention_7d',
        'user_retention_30d',
        'customer_lifetime_value',
    ],
    'health': [
        'model_prediction_latency_p99',
        'coverage_rate',  # % queries with results
        'diversity_score',  # Variety of items shown
    ]
}

Key Takeaways

CTR ≠ Value - Optimize for revenue and satisfaction, not clicks
Feedback loops are dangerous - Include exploration
Position bias is real - Correct for it in training
2+ weeks minimum - Capture weekly patterns, novelty fade
Segment everything - Aggregate success can hide segment harm
Gradual rollout - Stop at each stage to verify

Golden rule: The A/B test metric is rarely the metric that matters. Track what you optimize, but also track what matters to the business.

The Interview Question​

Clarifying Questions to Ask​

Common ML Production Failures​

Failure 1: Optimizing the Wrong Metric (Goodhart's Law)​

Failure 2: Feedback Loop Amplification​

Failure 3: Position Bias Not Corrected​

Failure 4: Distribution Shift​

Failure 5: A/B Test Was Too Short​

Failure 6: Segment-Level Harm Hidden in Aggregate​

Safe ML Model Rollout​

Monitoring Dashboard​

Key Takeaways​