All Users Got Logged Out
The Interview Question
"At 3 AM, all 10 million users got logged out simultaneously. Customer support is flooded. What happened and how do we prevent this?"
Asked at: Auth0, Okta, large B2C companies
Time to solve: 30-35 minutes
Difficulty: ⭐⭐⭐⭐ (Senior)
Clarifying Questions to Ask
- "What authentication system?" → JWT, session-based, OAuth?
- "Where are sessions stored?" → Redis, database, in-memory?
- "Any deploys or config changes around that time?" → Root cause
- "Are users able to log back in?" → Still broken vs mass disruption
- "Are some users unaffected?" → Pattern identification
Common Root Causes
Cause 1: JWT Secret Key Rotation
# 🔴 What went wrong:
# Old secret: "super-secret-key-v1"
# New secret: "super-secret-key-v2" # Deployed without coordination
# All existing tokens instantly invalid!
def verify_token(token):
try:
return jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
except jwt.InvalidSignatureError:
# Every user's token fails verification!
raise Unauthorized("Invalid token")
The Fix: Key Rotation with Grace Period
import jwt
from datetime import datetime, timedelta
class JWTValidator:
def __init__(self):
self.secrets = {
'v2': ('new-secret-key', datetime.utcnow()),
'v1': ('old-secret-key', datetime.utcnow() - timedelta(days=7)),
}
self.grace_period = timedelta(days=7)
def verify(self, token):
# Get key version from token header
unverified = jwt.decode(token, options={"verify_signature": False})
key_id = unverified.get('kid', 'v1') # Default to old key
secret, created_at = self.secrets.get(key_id)
if secret is None:
raise Unauthorized("Unknown key version")
# Verify with the appropriate key
return jwt.decode(token, secret, algorithms=['HS256'])
def create_token(self, payload):
# Always sign with newest key
payload['kid'] = 'v2'
return jwt.sign(payload, self.secrets['v2'][0], algorithm='HS256')
def cleanup_old_keys(self):
"""Remove keys older than grace period."""
cutoff = datetime.utcnow() - self.grace_period
self.secrets = {
k: v for k, v in self.secrets.items()
if v[1] > cutoff or k == self.get_current_key()
}
Cause 2: Redis Session Store Flushed
# 🔴 Someone ran this (accidentally or maliciously):
redis-cli FLUSHALL
# All 10M sessions deleted instantly!
# Or Redis restarted without persistence:
# - No AOF
# - No RDB snapshot
# Sessions existed only in memory → gone
The Fix: Protected Redis with Persistence
# redis.conf
appendonly yes # Enable AOF persistence
appendfsync everysec # Sync every second (balance of safety/perf)
rename-command FLUSHALL "" # Disable dangerous commands
rename-command FLUSHDB ""
rename-command DEBUG ""
requirepass "strong-password" # Require authentication
# Sentinel for HA
sentinel monitor mymaster redis-primary 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
# Application: Handle Redis failures gracefully
from functools import wraps
import redis
def session_resilient(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except redis.RedisError as e:
logger.error(f"Redis error: {e}")
# Fallback: Create temporary session
# Or return cached data if available
return handle_session_failure()
return wrapper
@session_resilient
def get_session(session_id):
return redis_client.get(f"session:{session_id}")
Cause 3: Cookie Domain/Path Misconfiguration
# 🔴 Deploy changed cookie settings:
# Before: domain=".example.com", path="/"
# After: domain="www.example.com", path="/app"
# All existing cookies no longer match!
# The fix: Be very careful with cookie settings
response.set_cookie(
'session_id',
value=session_id,
domain='.example.com', # Leading dot for subdomains
path='/', # Root path
httponly=True,
secure=True,
samesite='Lax',
max_age=86400 * 30 # 30 days
)
# Rollback: Immediately redeploy with old cookie settings
Cause 4: Time Synchronization Issue
# 🔴 Server clocks drifted, all tokens appear "expired"
# JWT validation:
def verify_token(token):
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
if payload['exp'] < time.time(): # Server time is 1 hour ahead!
raise Unauthorized("Token expired")
return payload
# The fix: Use NTP and add clock skew tolerance
def verify_token(token):
return jwt.decode(
token,
SECRET_KEY,
algorithms=['HS256'],
leeway=timedelta(minutes=5) # Allow 5 min clock skew
)
# Ensure NTP is running on all servers
sudo systemctl enable chronyd
sudo systemctl start chronyd
chronyc tracking # Verify sync
Cause 5: OAuth Provider Outage
# 🔴 Using external OAuth (Google, Facebook, Auth0)
# Provider went down → All token validations fail
# The fix: Cache valid tokens locally
class CachedTokenValidator:
def __init__(self, oauth_provider, cache_ttl=300):
self.provider = oauth_provider
self.cache = {}
self.cache_ttl = cache_ttl
def validate(self, token):
# Check local cache first
cached = self.cache.get(token)
if cached and cached['expires_at'] > time.time():
return cached['user_info']
try:
# Validate with provider
user_info = self.provider.validate(token)
# Cache the result
self.cache[token] = {
'user_info': user_info,
'expires_at': time.time() + self.cache_ttl
}
return user_info
except ProviderUnavailableError:
# Provider down - use cached data if available
if cached:
logger.warning("Using cached token validation - provider down")
return cached['user_info']
# No cache, provider down - fail open or closed based on risk
raise Unauthorized("Unable to validate token")
Cause 6: Database Migration Gone Wrong
-- 🔴 Migration dropped the sessions table
DROP TABLE IF EXISTS sessions; -- Oops, this was production!
CREATE TABLE sessions (...);
-- Or truncated instead of selective delete
TRUNCATE TABLE sessions; -- All 10M sessions gone!
The Fix: Safe Migrations
# Always use soft deletes for session data
class Session(Base):
__tablename__ = 'sessions'
id = Column(String, primary_key=True)
user_id = Column(String, nullable=False)
data = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
expires_at = Column(DateTime)
deleted_at = Column(DateTime, nullable=True) # Soft delete
@classmethod
def get_active(cls, session_id):
return cls.query.filter_by(
id=session_id,
deleted_at=None
).filter(
cls.expires_at > datetime.utcnow()
).first()
# Migration checklist
migration_safety:
- [ ] Test migration on staging with production data copy
- [ ] Run migration in transaction (rollback on error)
- [ ] Never DROP or TRUNCATE in production migrations
- [ ] Have immediate rollback plan ready
- [ ] Schedule during low-traffic window
- [ ] Monitor session counts during/after migration
Emergency Recovery Procedure
# mass_logout_recovery.py
def recover_from_mass_logout():
"""
Emergency recovery when all users logged out.
"""
# 1. Identify and fix root cause first!
# (JWT key, Redis, cookies, etc.)
# 2. If sessions are lost, allow re-authentication with reduced friction
@app.route('/login', methods=['POST'])
def login_with_reduced_friction():
# Skip MFA for next 24 hours if user has trusted device
# Or use "remember this device" token
user = authenticate_basic(request.form)
if user.has_trusted_device(request.device_fingerprint):
# Skip MFA - device was previously trusted
return create_session(user)
else:
return require_mfa(user)
# 3. Proactively notify users
send_notification_to_all_users(
"We experienced a brief authentication issue. "
"You may need to log in again. We apologize for the inconvenience."
)
# 4. Monitor login success rate
track_metric('login_success_rate')
track_metric('support_tickets_auth')
Preventive Architecture
Monitoring & Alerts
# Alert rules for auth health
groups:
- name: auth_alerts
rules:
- alert: MassLogoutDetected
expr: |
rate(active_sessions[5m]) < -1000 # Losing 1000+ sessions/sec
for: 1m
labels:
severity: critical
annotations:
summary: "Potential mass logout event"
- alert: LoginSpikeAfterDrop
expr: |
rate(login_attempts[5m]) > 10 * avg_over_time(rate(login_attempts[1h])[1d:1h])
for: 2m
labels:
severity: warning
annotations:
summary: "Unusual login spike - possible auth issue"
- alert: SessionStoreUnhealthy
expr: redis_up == 0 or redis_connected_slaves < 1
for: 30s
labels:
severity: critical
Key Takeaways
- Never rotate secrets without grace period - Support both old and new
- Protect session stores - Disable FLUSHALL, enable persistence
- Test auth changes extensively - Cookie changes are subtle
- Have fallbacks - Cache tokens locally for provider outages
- Monitor session counts - Alert on unusual drops
- Document recovery procedures - Know exactly what to do at 3 AM
Golden rule: Auth failures affect 100% of users instantly. Test more, deploy less, monitor constantly.