DNS Change Gone Wrong

The Interview Question

"We changed our DNS from old load balancer to new one. Half our users can access the site, half can't. It's been 4 hours since the change. What's happening and how do we fix it?"

Asked at: Amazon, Cloudflare, any company with global infrastructure

Time to solve: 25-30 minutes

Difficulty: ⭐⭐⭐ (Senior SRE)

Clarifying Questions to Ask

"What was the TTL on the old record?" → Determines cache duration
"Did we test the new endpoint before switching?" → Might be misconfigured
"Are affected users in specific regions?" → DNS propagation varies by region
"Can users access via IP directly?" → Isolates DNS vs app issue
"What kind of DNS change?" → A record, CNAME, nameserver change?

Why DNS Changes Take So Long

The problem: Even if you set TTL=300 (5 min), some resolvers ignore it!

Google DNS (8.8.8.8): Usually respects TTL
ISP resolvers: Often cache longer (hours to days)
Corporate proxies: May cache indefinitely
CDN edge caches: Have their own TTL

Immediate Diagnosis

Step 1: Check DNS Propagation

# Check what different DNS servers see
dig example.com @8.8.8.8        # Google
dig example.com @1.1.1.1        # Cloudflare
dig example.com @208.67.222.222 # OpenDNS
dig example.com @9.9.9.9        # Quad9

# Check authoritative servers directly
dig example.com @ns1.your-dns-provider.com

# Check remaining TTL on cached record
dig example.com +ttlunits

# Full propagation check
for ns in 8.8.8.8 1.1.1.1 208.67.222.222; do
  echo "=== $ns ==="
  dig +short example.com @$ns
done

Step 2: Identify What Changed

# Compare old vs new records
# Old: 192.168.1.100 (old LB)
# New: 192.168.2.200 (new LB)

# Check WHOIS for recent changes
whois example.com | grep -i "updated"

Step 3: Verify New Endpoint Works

# Bypass DNS, test directly
curl -H "Host: example.com" http://192.168.2.200/health

# Check if SSL works with new IP
openssl s_client -connect 192.168.2.200:443 -servername example.com

Root Causes

Cause 1: High TTL on Old Records

# If old TTL was 86400 (24 hours)
# Users will keep using old IP for up to 24 hours!

# Solution: Should have lowered TTL BEFORE the change
# 1. Lower TTL to 300 (5 min)
# 2. Wait for old TTL to expire (24+ hours)
# 3. THEN make the change
# 4. Restore TTL after change is stable

Cause 2: New Load Balancer Not Ready

# Common issues:
# - SSL certificate not installed
# - Security group blocking traffic
# - Health checks failing
# - Wrong backend servers configured

# Verify:
curl -v https://192.168.2.200/
# Check for SSL errors, connection refused, etc.

Cause 3: Split-Horizon DNS / GeoDNS Issues

# If using GeoDNS, some regions might get different records
# Check what different regions see

import dns.resolver

resolvers = {
    'US': '8.8.8.8',
    'EU': '1.1.1.1', 
    'APAC': '9.9.9.9'
}

for region, resolver in resolvers.items():
    r = dns.resolver.Resolver()
    r.nameservers = [resolver]
    try:
        answer = r.resolve('example.com', 'A')
        print(f"{region}: {[rdata.address for rdata in answer]}")
    except Exception as e:
        print(f"{region}: Error - {e}")

Cause 4: CNAME Chain Broken

# If you changed a CNAME target:
example.com CNAME → www.example.com → old-lb.example.com

# And old-lb.example.com was deleted or changed:
# The whole chain breaks!

# Check the chain:
dig +trace example.com

Emergency Fixes

Fix 1: Rollback DNS

# Fastest fix: revert the DNS change
# Restore old A record pointing to old load balancer

# Using AWS Route 53 CLI:
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "example.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [{"Value": "192.168.1.100"}]
      }
    }]
  }'

Fix 2: Keep Both Endpoints Running

# Don't turn off old load balancer until DNS fully propagates
# Run both old and new for 48-72 hours

# Both should point to same backend servers
OLD_LB (192.168.1.100) → Backend Pool
NEW_LB (192.168.2.200) → Backend Pool

Fix 3: Use DNS Failover

# Configure health-checked DNS failover
resource "aws_route53_health_check" "primary" {
  ip_address        = "192.168.2.200"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  health_check_id = aws_route53_health_check.primary.id
  records         = ["192.168.2.200"]
}

resource "aws_route53_record" "secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  records = ["192.168.1.100"]  # Old LB as fallback
}

Fix 4: Weighted DNS for Gradual Migration

# Route 10% to new, 90% to old initially
resource "aws_route53_record" "new" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 10
  }
  
  set_identifier = "new-lb"
  records        = ["192.168.2.200"]
}

resource "aws_route53_record" "old" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 90
  }
  
  set_identifier = "old-lb"
  records        = ["192.168.1.100"]
}

# Gradually shift: 10/90 → 50/50 → 90/10 → 100/0

Safe DNS Change Procedure

dns_change_checklist:

  before_change:
    - [ ] Lower TTL to 60-300 seconds
    - [ ] Wait 2x old TTL for propagation
    - [ ] Verify new endpoint is healthy
    - [ ] Prepare rollback plan
    - [ ] Schedule during low traffic
    - [ ] Alert team of maintenance window
  
  during_change:
    - [ ] Make DNS change
    - [ ] Monitor both old and new endpoints
    - [ ] Watch error rates in monitoring
    - [ ] Check propagation with multiple DNS servers
  
  after_change:
    - [ ] Keep old endpoint running 48-72 hours
    - [ ] Verify traffic shifting to new endpoint
    - [ ] Check for errors from both endpoints
    - [ ] Restore TTL to normal (3600+) after stable
    - [ ] Decommission old endpoint only when traffic = 0

Monitoring DNS Changes

# dns_monitor.py
import dns.resolver
import time
from prometheus_client import Gauge, start_http_server

dns_propagation = Gauge('dns_propagation_complete', 
                        'DNS propagation status by resolver',
                        ['resolver'])

RESOLVERS = {
    'google': '8.8.8.8',
    'cloudflare': '1.1.1.1',
    'opendns': '208.67.222.222'
}

EXPECTED_IP = '192.168.2.200'  # New IP we're migrating to

def check_dns(domain, resolver):
    r = dns.resolver.Resolver()
    r.nameservers = [resolver]
    try:
        answers = r.resolve(domain, 'A')
        return any(rdata.address == EXPECTED_IP for rdata in answers)
    except:
        return False

def monitor_propagation(domain):
    while True:
        for name, resolver in RESOLVERS.items():
            result = check_dns(domain, resolver)
            dns_propagation.labels(resolver=name).set(1 if result else 0)
            print(f"{name}: {'✅' if result else '❌'}")
        
        all_propagated = all(
            dns_propagation.labels(resolver=name)._value._value 
            for name in RESOLVERS
        )
        if all_propagated:
            print("🎉 DNS fully propagated!")
        
        time.sleep(60)

if __name__ == '__main__':
    start_http_server(8000)
    monitor_propagation('example.com')

Key Takeaways

Lower TTL first - Days before the actual change
Keep old running - For 48-72 hours minimum
Test new endpoint - Before switching DNS
Use weighted routing - For gradual migration
Monitor propagation - Check multiple resolvers
Have rollback ready - Can revert in seconds

Golden rule: DNS changes are "eventually consistent" - plan for worst case TTL caching.

The Interview Question​

Clarifying Questions to Ask​

Why DNS Changes Take So Long​

Immediate Diagnosis​

Step 1: Check DNS Propagation​

Step 2: Identify What Changed​

Step 3: Verify New Endpoint Works​

Root Causes​

Cause 1: High TTL on Old Records​

Cause 2: New Load Balancer Not Ready​

Cause 3: Split-Horizon DNS / GeoDNS Issues​

Cause 4: CNAME Chain Broken​

Emergency Fixes​

Fix 1: Rollback DNS​

Fix 2: Keep Both Endpoints Running​

Fix 3: Use DNS Failover​

Fix 4: Weighted DNS for Gradual Migration​

Safe DNS Change Procedure​

Monitoring DNS Changes​

Key Takeaways​