Skip to main content

Search Ranking Disaster

The Interview Question

"We A/B tested a new ML search ranking model. Test showed +8% click-through rate. We rolled out to 100% of users. Two weeks later, revenue is down 15% and user engagement dropped. What happened?"

Asked at: Google, Amazon, Netflix, any company with search/recommendations

Time to solve: 35-40 minutes

Difficulty: ⭐⭐⭐⭐ (Senior ML + Product)


Clarifying Questions to Ask

  1. "What was the A/B test duration?" → Too short = seasonal effects
  2. "What metrics were tracked?" → CTR vs revenue vs user value
  3. "How was the model trained?" → Training data bias?
  4. "What's the user feedback loop?" → Click = engagement ≠ satisfaction
  5. "Did we segment results?" → Some segments might be hurt

Common ML Production Failures

Failure 1: Optimizing the Wrong Metric (Goodhart's Law)

# 🔴 Model optimized for clicks
# But clicks != purchases != long-term value

# Model learned: Show clickbait titles first
# Result: Users click, don't buy, lose trust

# Example:
# Old ranking: [Quality Product, Good Match, Relevant Item]
# New ranking: [AMAZING DEAL 90% OFF!!!, Click Here!!!, ...]

# CTR up ✅
# Conversion down ❌
# LTV down ❌❌

The Fix: Multi-Objective Optimization

class SearchRanker:
def __init__(self):
self.click_model = load_model('click_model')
self.purchase_model = load_model('purchase_model')
self.satisfaction_model = load_model('satisfaction_model')

def rank(self, query, items, user):
scores = []
for item in items:
features = extract_features(query, item, user)

# Multi-objective score
click_prob = self.click_model.predict(features)
purchase_prob = self.purchase_model.predict(features)
satisfaction = self.satisfaction_model.predict(features)

# Weighted combination - revenue AND satisfaction
score = (
0.2 * click_prob +
0.4 * purchase_prob +
0.3 * satisfaction +
0.1 * item.quality_score # Editorial quality
)
scores.append((item, score))

return sorted(scores, key=lambda x: -x[1])

Failure 2: Feedback Loop Amplification

Week 1: Model ranks Product A higher
Week 2: Product A gets more clicks (because it's shown more)
Week 3: Model sees more clicks → ranks A even higher
Week 4: Product A dominates all searches (regardless of relevance)

Meanwhile: Good products never get shown → can't prove they're good

The Fix: Exploration-Exploitation Balance

import numpy as np

class ExplorationRanker:
def __init__(self, exploration_rate=0.1):
self.exploration_rate = exploration_rate

def rank(self, query, items, user):
# Get model scores
base_ranking = self.model.rank(query, items, user)

# Randomly explore for some percentage of users
if np.random.random() < self.exploration_rate:
# Thompson sampling or random shuffle for exploration
return self.explore_rank(items)

return base_ranking

def explore_rank(self, items):
"""
Show some items that wouldn't normally be top-ranked.
Collect data on items with uncertain quality.
"""
# Mix: 70% model ranking, 30% random
model_items = items[:7]
explore_items = np.random.choice(items[7:], size=3, replace=False)

combined = list(model_items) + list(explore_items)
np.random.shuffle(combined) # Mix them up

return combined

Failure 3: Position Bias Not Corrected

# 🔴 Problem: Items shown at position 1 always get more clicks
# Model learns: "Product A gets clicks"
# Reality: "Position 1 gets clicks, A happened to be there"

# Training data:
# | Item | Position | Clicked |
# |------|----------|---------|
# | A | 1 | Yes | ← A was shown first
# | B | 5 | No | ← B never had a chance

# Model conclusion: A > B
# Reality: B might be better if given position 1

The Fix: Inverse Propensity Weighting

class UnbiasedRankingTrainer:
def __init__(self):
# Position click-through rates from randomized experiments
self.position_bias = {
1: 0.35, # 35% CTR at position 1
2: 0.20,
3: 0.12,
4: 0.08,
5: 0.05,
# ...
}

def compute_ipw_weight(self, position):
"""
Inverse Propensity Weight to correct for position bias.
"""
return 1.0 / self.position_bias.get(position, 0.01)

def train(self, training_data):
weighted_data = []

for sample in training_data:
weight = self.compute_ipw_weight(sample.position)

# If clicked at position 5, this is more impressive
# than clicking at position 1 - weight it higher
weighted_data.append({
**sample,
'weight': weight if sample.clicked else 1.0
})

return self.model.fit(weighted_data, sample_weight='weight')

Failure 4: Distribution Shift

# 🔴 Model trained on old data, deployed to new reality

# Training data (last year):
# - Black Friday: 5% of queries
# - Normal shopping: 95%

# Deployment reality (Black Friday week):
# - Black Friday: 80% of queries
# - Normal shopping: 20%

# Model never saw "Black Friday mindset" users at scale
# Recommendations are wrong for deal-seekers

The Fix: Monitor Data Drift

from scipy import stats

class DataDriftMonitor:
def __init__(self, reference_distribution):
self.reference = reference_distribution

def detect_drift(self, current_data, threshold=0.05):
"""
KS test to detect if current data differs from training data.
"""
drift_detected = {}

for feature in self.reference.columns:
statistic, p_value = stats.ks_2samp(
self.reference[feature],
current_data[feature]
)

drift_detected[feature] = {
'statistic': statistic,
'p_value': p_value,
'drift': p_value < threshold
}

if p_value < threshold:
alert(f"Data drift detected in {feature}!")

return drift_detected

def should_retrain(self):
"""
Trigger retraining if significant drift detected.
"""
drift_report = self.detect_drift(get_recent_data())
drifted_features = [f for f, d in drift_report.items() if d['drift']]

return len(drifted_features) > len(drift_report) * 0.2 # 20% features drifted

Failure 5: A/B Test Was Too Short

# 🔴 One-week test missed important patterns:

# Week 1 (test): New users excited about new UI → +8% CTR
# Week 2: Novelty wears off → +2% CTR
# Week 3: Users can't find what they need → -5% CTR
# Week 4+: Users leave platform → -15% revenue

# Also missed:
# - Weekend vs weekday patterns
# - Monthly paycheck cycle
# - Seasonal products

The Fix: Proper Experiment Design

def calculate_minimum_test_duration(
baseline_rate: float,
minimum_detectable_effect: float,
confidence_level: float = 0.95,
power: float = 0.80,
daily_traffic: int = 100000
) -> dict:
"""
Calculate proper A/B test duration.
"""
from statsmodels.stats.power import NormalIndPower

# Calculate required sample size
analysis = NormalIndPower()
effect_size = minimum_detectable_effect / baseline_rate

sample_size_per_group = analysis.solve_power(
effect_size=effect_size,
alpha=1 - confidence_level,
power=power,
ratio=1.0,
alternative='two-sided'
)

total_samples_needed = sample_size_per_group * 2
days_needed = total_samples_needed / daily_traffic

# Minimum duration rules
minimum_days = max(
days_needed,
14, # At least 2 weeks for weekly patterns
# 28 for monthly patterns if applicable
)

return {
'sample_size_per_group': sample_size_per_group,
'days_needed_statistically': days_needed,
'recommended_days': minimum_days,
'reason': 'Include at least 2 weekends and account for novelty effect'
}

Failure 6: Segment-Level Harm Hidden in Aggregate

# 🔴 Aggregate metrics looked good, but...

# Segment breakdown:
# Power users (10% of users, 40% of revenue): -20% engagement
# Casual users (90% of users, 60% of revenue): +10% engagement

# Net effect: +8% engagement looks good!
# But power users are leaving → long-term revenue disaster

The Fix: Always Segment Analysis

def segmented_experiment_analysis(experiment_data):
"""
Analyze experiment across key user segments.
"""
segments = {
'user_tenure': ['new', '1-6_months', '6-12_months', '1yr+'],
'activity_level': ['low', 'medium', 'high', 'power'],
'platform': ['ios', 'android', 'web'],
'geography': ['us', 'eu', 'apac', 'other'],
'subscription': ['free', 'premium'],
}

results = {}

for segment_name, segment_values in segments.items():
for value in segment_values:
segment_data = experiment_data[
experiment_data[segment_name] == value
]

control = segment_data[segment_data['variant'] == 'control']
treatment = segment_data[segment_data['variant'] == 'treatment']

metrics = {
'ctr': (treatment['clicked'].mean() - control['clicked'].mean()) / control['clicked'].mean(),
'conversion': (treatment['purchased'].mean() - control['purchased'].mean()) / control['purchased'].mean(),
'revenue': (treatment['revenue'].sum() - control['revenue'].sum()) / control['revenue'].sum(),
}

results[f"{segment_name}:{value}"] = metrics

# Alert if any segment shows significant harm
if metrics['revenue'] < -0.05: # 5% drop
alert(f"⚠️ Segment {segment_name}={value} shows revenue drop: {metrics['revenue']:.1%}")

return results

Safe ML Model Rollout

ml_rollout_checklist:

pre_launch:
- [ ] Train on diverse, representative data
- [ ] Evaluate on held-out time periods (not random split)
- [ ] Check for position bias correction
- [ ] Define guardrail metrics (must not degrade)

a_b_test:
- [ ] Run for 2+ weeks minimum
- [ ] Analyze all key segments
- [ ] Check novelty effect (is lift decreasing over time?)
- [ ] Monitor long-term metrics (retention, LTV)

gradual_rollout:
- [ ] 1% → 5% → 25% → 50% → 100%
- [ ] Wait 1 week at each stage
- [ ] Monitor business metrics at each stage
- [ ] Automatic rollback if guardrails breached

post_launch:
- [ ] Monitor for feedback loops
- [ ] Watch for distribution drift
- [ ] Plan for model refresh/retraining
- [ ] Document learnings

Monitoring Dashboard

# Key metrics to track for search ranking
RANKING_METRICS = {
'immediate': [
'click_through_rate',
'queries_per_session',
'search_result_position_clicked',
],
'short_term': [
'conversion_rate',
'average_order_value',
'items_per_order',
],
'long_term': [
'user_retention_7d',
'user_retention_30d',
'customer_lifetime_value',
],
'health': [
'model_prediction_latency_p99',
'coverage_rate', # % queries with results
'diversity_score', # Variety of items shown
]
}

Key Takeaways

  1. CTR ≠ Value - Optimize for revenue and satisfaction, not clicks
  2. Feedback loops are dangerous - Include exploration
  3. Position bias is real - Correct for it in training
  4. 2+ weeks minimum - Capture weekly patterns, novelty fade
  5. Segment everything - Aggregate success can hide segment harm
  6. Gradual rollout - Stop at each stage to verify

Golden rule: The A/B test metric is rarely the metric that matters. Track what you optimize, but also track what matters to the business.