Skip to main content

All Users Got Logged Out

The Interview Question

"At 3 AM, all 10 million users got logged out simultaneously. Customer support is flooded. What happened and how do we prevent this?"

Asked at: Auth0, Okta, large B2C companies

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior)


Clarifying Questions to Ask

  1. "What authentication system?" → JWT, session-based, OAuth?
  2. "Where are sessions stored?" → Redis, database, in-memory?
  3. "Any deploys or config changes around that time?" → Root cause
  4. "Are users able to log back in?" → Still broken vs mass disruption
  5. "Are some users unaffected?" → Pattern identification

Common Root Causes

Cause 1: JWT Secret Key Rotation

# 🔴 What went wrong:
# Old secret: "super-secret-key-v1"
# New secret: "super-secret-key-v2" # Deployed without coordination

# All existing tokens instantly invalid!
def verify_token(token):
try:
return jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
except jwt.InvalidSignatureError:
# Every user's token fails verification!
raise Unauthorized("Invalid token")

The Fix: Key Rotation with Grace Period

import jwt
from datetime import datetime, timedelta

class JWTValidator:
def __init__(self):
self.secrets = {
'v2': ('new-secret-key', datetime.utcnow()),
'v1': ('old-secret-key', datetime.utcnow() - timedelta(days=7)),
}
self.grace_period = timedelta(days=7)

def verify(self, token):
# Get key version from token header
unverified = jwt.decode(token, options={"verify_signature": False})
key_id = unverified.get('kid', 'v1') # Default to old key

secret, created_at = self.secrets.get(key_id)

if secret is None:
raise Unauthorized("Unknown key version")

# Verify with the appropriate key
return jwt.decode(token, secret, algorithms=['HS256'])

def create_token(self, payload):
# Always sign with newest key
payload['kid'] = 'v2'
return jwt.sign(payload, self.secrets['v2'][0], algorithm='HS256')

def cleanup_old_keys(self):
"""Remove keys older than grace period."""
cutoff = datetime.utcnow() - self.grace_period
self.secrets = {
k: v for k, v in self.secrets.items()
if v[1] > cutoff or k == self.get_current_key()
}

Cause 2: Redis Session Store Flushed

# 🔴 Someone ran this (accidentally or maliciously):
redis-cli FLUSHALL
# All 10M sessions deleted instantly!

# Or Redis restarted without persistence:
# - No AOF
# - No RDB snapshot
# Sessions existed only in memory → gone

The Fix: Protected Redis with Persistence

# redis.conf
appendonly yes # Enable AOF persistence
appendfsync everysec # Sync every second (balance of safety/perf)
rename-command FLUSHALL "" # Disable dangerous commands
rename-command FLUSHDB ""
rename-command DEBUG ""
requirepass "strong-password" # Require authentication

# Sentinel for HA
sentinel monitor mymaster redis-primary 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
# Application: Handle Redis failures gracefully
from functools import wraps
import redis

def session_resilient(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except redis.RedisError as e:
logger.error(f"Redis error: {e}")
# Fallback: Create temporary session
# Or return cached data if available
return handle_session_failure()
return wrapper

@session_resilient
def get_session(session_id):
return redis_client.get(f"session:{session_id}")
# 🔴 Deploy changed cookie settings:
# Before: domain=".example.com", path="/"
# After: domain="www.example.com", path="/app"
# All existing cookies no longer match!

# The fix: Be very careful with cookie settings
response.set_cookie(
'session_id',
value=session_id,
domain='.example.com', # Leading dot for subdomains
path='/', # Root path
httponly=True,
secure=True,
samesite='Lax',
max_age=86400 * 30 # 30 days
)

# Rollback: Immediately redeploy with old cookie settings

Cause 4: Time Synchronization Issue

# 🔴 Server clocks drifted, all tokens appear "expired"
# JWT validation:
def verify_token(token):
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])

if payload['exp'] < time.time(): # Server time is 1 hour ahead!
raise Unauthorized("Token expired")

return payload

# The fix: Use NTP and add clock skew tolerance
def verify_token(token):
return jwt.decode(
token,
SECRET_KEY,
algorithms=['HS256'],
leeway=timedelta(minutes=5) # Allow 5 min clock skew
)
# Ensure NTP is running on all servers
sudo systemctl enable chronyd
sudo systemctl start chronyd
chronyc tracking # Verify sync

Cause 5: OAuth Provider Outage

# 🔴 Using external OAuth (Google, Facebook, Auth0)
# Provider went down → All token validations fail

# The fix: Cache valid tokens locally
class CachedTokenValidator:
def __init__(self, oauth_provider, cache_ttl=300):
self.provider = oauth_provider
self.cache = {}
self.cache_ttl = cache_ttl

def validate(self, token):
# Check local cache first
cached = self.cache.get(token)
if cached and cached['expires_at'] > time.time():
return cached['user_info']

try:
# Validate with provider
user_info = self.provider.validate(token)

# Cache the result
self.cache[token] = {
'user_info': user_info,
'expires_at': time.time() + self.cache_ttl
}
return user_info

except ProviderUnavailableError:
# Provider down - use cached data if available
if cached:
logger.warning("Using cached token validation - provider down")
return cached['user_info']

# No cache, provider down - fail open or closed based on risk
raise Unauthorized("Unable to validate token")

Cause 6: Database Migration Gone Wrong

-- 🔴 Migration dropped the sessions table
DROP TABLE IF EXISTS sessions; -- Oops, this was production!
CREATE TABLE sessions (...);

-- Or truncated instead of selective delete
TRUNCATE TABLE sessions; -- All 10M sessions gone!

The Fix: Safe Migrations

# Always use soft deletes for session data
class Session(Base):
__tablename__ = 'sessions'

id = Column(String, primary_key=True)
user_id = Column(String, nullable=False)
data = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
expires_at = Column(DateTime)
deleted_at = Column(DateTime, nullable=True) # Soft delete

@classmethod
def get_active(cls, session_id):
return cls.query.filter_by(
id=session_id,
deleted_at=None
).filter(
cls.expires_at > datetime.utcnow()
).first()
# Migration checklist
migration_safety:
- [ ] Test migration on staging with production data copy
- [ ] Run migration in transaction (rollback on error)
- [ ] Never DROP or TRUNCATE in production migrations
- [ ] Have immediate rollback plan ready
- [ ] Schedule during low-traffic window
- [ ] Monitor session counts during/after migration

Emergency Recovery Procedure

# mass_logout_recovery.py

def recover_from_mass_logout():
"""
Emergency recovery when all users logged out.
"""

# 1. Identify and fix root cause first!
# (JWT key, Redis, cookies, etc.)

# 2. If sessions are lost, allow re-authentication with reduced friction
@app.route('/login', methods=['POST'])
def login_with_reduced_friction():
# Skip MFA for next 24 hours if user has trusted device
# Or use "remember this device" token

user = authenticate_basic(request.form)

if user.has_trusted_device(request.device_fingerprint):
# Skip MFA - device was previously trusted
return create_session(user)
else:
return require_mfa(user)

# 3. Proactively notify users
send_notification_to_all_users(
"We experienced a brief authentication issue. "
"You may need to log in again. We apologize for the inconvenience."
)

# 4. Monitor login success rate
track_metric('login_success_rate')
track_metric('support_tickets_auth')

Preventive Architecture


Monitoring & Alerts

# Alert rules for auth health
groups:
- name: auth_alerts
rules:
- alert: MassLogoutDetected
expr: |
rate(active_sessions[5m]) < -1000 # Losing 1000+ sessions/sec
for: 1m
labels:
severity: critical
annotations:
summary: "Potential mass logout event"

- alert: LoginSpikeAfterDrop
expr: |
rate(login_attempts[5m]) > 10 * avg_over_time(rate(login_attempts[1h])[1d:1h])
for: 2m
labels:
severity: warning
annotations:
summary: "Unusual login spike - possible auth issue"

- alert: SessionStoreUnhealthy
expr: redis_up == 0 or redis_connected_slaves < 1
for: 30s
labels:
severity: critical

Key Takeaways

  1. Never rotate secrets without grace period - Support both old and new
  2. Protect session stores - Disable FLUSHALL, enable persistence
  3. Test auth changes extensively - Cookie changes are subtle
  4. Have fallbacks - Cache tokens locally for provider outages
  5. Monitor session counts - Alert on unusual drops
  6. Document recovery procedures - Know exactly what to do at 3 AM

Golden rule: Auth failures affect 100% of users instantly. Test more, deploy less, monitor constantly.