All Users Got Logged Out

The Interview Question

"At 3 AM, all 10 million users got logged out simultaneously. Customer support is flooded. What happened and how do we prevent this?"

Asked at: Auth0, Okta, large B2C companies

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior)

Clarifying Questions to Ask

"What authentication system?" → JWT, session-based, OAuth?
"Where are sessions stored?" → Redis, database, in-memory?
"Any deploys or config changes around that time?" → Root cause
"Are users able to log back in?" → Still broken vs mass disruption
"Are some users unaffected?" → Pattern identification

Common Root Causes

Cause 1: JWT Secret Key Rotation

# 🔴 What went wrong:
# Old secret: "super-secret-key-v1"
# New secret: "super-secret-key-v2"  # Deployed without coordination

# All existing tokens instantly invalid!
def verify_token(token):
    try:
        return jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
    except jwt.InvalidSignatureError:
        # Every user's token fails verification!
        raise Unauthorized("Invalid token")

The Fix: Key Rotation with Grace Period

import jwt
from datetime import datetime, timedelta

class JWTValidator:
    def __init__(self):
        self.secrets = {
            'v2': ('new-secret-key', datetime.utcnow()),
            'v1': ('old-secret-key', datetime.utcnow() - timedelta(days=7)),
        }
        self.grace_period = timedelta(days=7)
    
    def verify(self, token):
        # Get key version from token header
        unverified = jwt.decode(token, options={"verify_signature": False})
        key_id = unverified.get('kid', 'v1')  # Default to old key
        
        secret, created_at = self.secrets.get(key_id)
        
        if secret is None:
            raise Unauthorized("Unknown key version")
        
        # Verify with the appropriate key
        return jwt.decode(token, secret, algorithms=['HS256'])
    
    def create_token(self, payload):
        # Always sign with newest key
        payload['kid'] = 'v2'
        return jwt.sign(payload, self.secrets['v2'][0], algorithm='HS256')
    
    def cleanup_old_keys(self):
        """Remove keys older than grace period."""
        cutoff = datetime.utcnow() - self.grace_period
        self.secrets = {
            k: v for k, v in self.secrets.items() 
            if v[1] > cutoff or k == self.get_current_key()
        }

Cause 2: Redis Session Store Flushed

# 🔴 Someone ran this (accidentally or maliciously):
redis-cli FLUSHALL
# All 10M sessions deleted instantly!

# Or Redis restarted without persistence:
# - No AOF
# - No RDB snapshot
# Sessions existed only in memory → gone

The Fix: Protected Redis with Persistence

# redis.conf
appendonly yes                    # Enable AOF persistence
appendfsync everysec              # Sync every second (balance of safety/perf)
rename-command FLUSHALL ""        # Disable dangerous commands
rename-command FLUSHDB ""
rename-command DEBUG ""
requirepass "strong-password"     # Require authentication

# Sentinel for HA
sentinel monitor mymaster redis-primary 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000

# Application: Handle Redis failures gracefully
from functools import wraps
import redis

def session_resilient(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except redis.RedisError as e:
            logger.error(f"Redis error: {e}")
            # Fallback: Create temporary session
            # Or return cached data if available
            return handle_session_failure()
    return wrapper

@session_resilient
def get_session(session_id):
    return redis_client.get(f"session:{session_id}")

# 🔴 Deploy changed cookie settings:
# Before: domain=".example.com", path="/"
# After:  domain="www.example.com", path="/app"
# All existing cookies no longer match!

# The fix: Be very careful with cookie settings
response.set_cookie(
    'session_id',
    value=session_id,
    domain='.example.com',  # Leading dot for subdomains
    path='/',               # Root path
    httponly=True,
    secure=True,
    samesite='Lax',
    max_age=86400 * 30      # 30 days
)

# Rollback: Immediately redeploy with old cookie settings

Cause 4: Time Synchronization Issue

# 🔴 Server clocks drifted, all tokens appear "expired"
# JWT validation:
def verify_token(token):
    payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
    
    if payload['exp'] < time.time():  # Server time is 1 hour ahead!
        raise Unauthorized("Token expired")
    
    return payload

# The fix: Use NTP and add clock skew tolerance
def verify_token(token):
    return jwt.decode(
        token, 
        SECRET_KEY, 
        algorithms=['HS256'],
        leeway=timedelta(minutes=5)  # Allow 5 min clock skew
    )

# Ensure NTP is running on all servers
sudo systemctl enable chronyd
sudo systemctl start chronyd
chronyc tracking  # Verify sync

Cause 5: OAuth Provider Outage

# 🔴 Using external OAuth (Google, Facebook, Auth0)
# Provider went down → All token validations fail

# The fix: Cache valid tokens locally
class CachedTokenValidator:
    def __init__(self, oauth_provider, cache_ttl=300):
        self.provider = oauth_provider
        self.cache = {}
        self.cache_ttl = cache_ttl
    
    def validate(self, token):
        # Check local cache first
        cached = self.cache.get(token)
        if cached and cached['expires_at'] > time.time():
            return cached['user_info']
        
        try:
            # Validate with provider
            user_info = self.provider.validate(token)
            
            # Cache the result
            self.cache[token] = {
                'user_info': user_info,
                'expires_at': time.time() + self.cache_ttl
            }
            return user_info
            
        except ProviderUnavailableError:
            # Provider down - use cached data if available
            if cached:
                logger.warning("Using cached token validation - provider down")
                return cached['user_info']
            
            # No cache, provider down - fail open or closed based on risk
            raise Unauthorized("Unable to validate token")

Cause 6: Database Migration Gone Wrong

-- 🔴 Migration dropped the sessions table
DROP TABLE IF EXISTS sessions;  -- Oops, this was production!
CREATE TABLE sessions (...);

-- Or truncated instead of selective delete
TRUNCATE TABLE sessions;  -- All 10M sessions gone!

The Fix: Safe Migrations

# Always use soft deletes for session data
class Session(Base):
    __tablename__ = 'sessions'
    
    id = Column(String, primary_key=True)
    user_id = Column(String, nullable=False)
    data = Column(JSON)
    created_at = Column(DateTime, default=datetime.utcnow)
    expires_at = Column(DateTime)
    deleted_at = Column(DateTime, nullable=True)  # Soft delete
    
    @classmethod
    def get_active(cls, session_id):
        return cls.query.filter_by(
            id=session_id,
            deleted_at=None
        ).filter(
            cls.expires_at > datetime.utcnow()
        ).first()

# Migration checklist
migration_safety:
  - [ ] Test migration on staging with production data copy
  - [ ] Run migration in transaction (rollback on error)
  - [ ] Never DROP or TRUNCATE in production migrations
  - [ ] Have immediate rollback plan ready
  - [ ] Schedule during low-traffic window
  - [ ] Monitor session counts during/after migration

Emergency Recovery Procedure

# mass_logout_recovery.py

def recover_from_mass_logout():
    """
    Emergency recovery when all users logged out.
    """
    
    # 1. Identify and fix root cause first!
    # (JWT key, Redis, cookies, etc.)
    
    # 2. If sessions are lost, allow re-authentication with reduced friction
    @app.route('/login', methods=['POST'])
    def login_with_reduced_friction():
        # Skip MFA for next 24 hours if user has trusted device
        # Or use "remember this device" token
        
        user = authenticate_basic(request.form)
        
        if user.has_trusted_device(request.device_fingerprint):
            # Skip MFA - device was previously trusted
            return create_session(user)
        else:
            return require_mfa(user)
    
    # 3. Proactively notify users
    send_notification_to_all_users(
        "We experienced a brief authentication issue. "
        "You may need to log in again. We apologize for the inconvenience."
    )
    
    # 4. Monitor login success rate
    track_metric('login_success_rate')
    track_metric('support_tickets_auth')

Preventive Architecture

Monitoring & Alerts

# Alert rules for auth health
groups:
- name: auth_alerts
  rules:
  - alert: MassLogoutDetected
    expr: |
      rate(active_sessions[5m]) < -1000  # Losing 1000+ sessions/sec
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Potential mass logout event"
      
  - alert: LoginSpikeAfterDrop
    expr: |
      rate(login_attempts[5m]) > 10 * avg_over_time(rate(login_attempts[1h])[1d:1h])
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Unusual login spike - possible auth issue"
      
  - alert: SessionStoreUnhealthy
    expr: redis_up == 0 or redis_connected_slaves < 1
    for: 30s
    labels:
      severity: critical

Key Takeaways

Never rotate secrets without grace period - Support both old and new
Protect session stores - Disable FLUSHALL, enable persistence
Test auth changes extensively - Cookie changes are subtle
Have fallbacks - Cache tokens locally for provider outages
Monitor session counts - Alert on unusual drops
Document recovery procedures - Know exactly what to do at 3 AM

Golden rule: Auth failures affect 100% of users instantly. Test more, deploy less, monitor constantly.

The Interview Question​

Clarifying Questions to Ask​

Common Root Causes​

Cause 1: JWT Secret Key Rotation​

Cause 2: Redis Session Store Flushed​

Cause 3: Cookie Domain/Path Misconfiguration​

Cause 4: Time Synchronization Issue​

Cause 5: OAuth Provider Outage​

Cause 6: Database Migration Gone Wrong​

Emergency Recovery Procedure​

Preventive Architecture​

Monitoring & Alerts​

Key Takeaways​