Skip to main content

DNS Change Gone Wrong

The Interview Question

"We changed our DNS from old load balancer to new one. Half our users can access the site, half can't. It's been 4 hours since the change. What's happening and how do we fix it?"

Asked at: Amazon, Cloudflare, any company with global infrastructure

Time to solve: 25-30 minutes

Difficulty: ⭐⭐⭐ (Senior SRE)


Clarifying Questions to Ask

  1. "What was the TTL on the old record?" → Determines cache duration
  2. "Did we test the new endpoint before switching?" → Might be misconfigured
  3. "Are affected users in specific regions?" → DNS propagation varies by region
  4. "Can users access via IP directly?" → Isolates DNS vs app issue
  5. "What kind of DNS change?" → A record, CNAME, nameserver change?

Why DNS Changes Take So Long

The problem: Even if you set TTL=300 (5 min), some resolvers ignore it!

Google DNS (8.8.8.8): Usually respects TTL
ISP resolvers: Often cache longer (hours to days)
Corporate proxies: May cache indefinitely
CDN edge caches: Have their own TTL

Immediate Diagnosis

Step 1: Check DNS Propagation

# Check what different DNS servers see
dig example.com @8.8.8.8 # Google
dig example.com @1.1.1.1 # Cloudflare
dig example.com @208.67.222.222 # OpenDNS
dig example.com @9.9.9.9 # Quad9

# Check authoritative servers directly
dig example.com @ns1.your-dns-provider.com

# Check remaining TTL on cached record
dig example.com +ttlunits

# Full propagation check
for ns in 8.8.8.8 1.1.1.1 208.67.222.222; do
echo "=== $ns ==="
dig +short example.com @$ns
done

Step 2: Identify What Changed

# Compare old vs new records
# Old: 192.168.1.100 (old LB)
# New: 192.168.2.200 (new LB)

# Check WHOIS for recent changes
whois example.com | grep -i "updated"

Step 3: Verify New Endpoint Works

# Bypass DNS, test directly
curl -H "Host: example.com" http://192.168.2.200/health

# Check if SSL works with new IP
openssl s_client -connect 192.168.2.200:443 -servername example.com

Root Causes

Cause 1: High TTL on Old Records

# If old TTL was 86400 (24 hours)
# Users will keep using old IP for up to 24 hours!

# Solution: Should have lowered TTL BEFORE the change
# 1. Lower TTL to 300 (5 min)
# 2. Wait for old TTL to expire (24+ hours)
# 3. THEN make the change
# 4. Restore TTL after change is stable

Cause 2: New Load Balancer Not Ready

# Common issues:
# - SSL certificate not installed
# - Security group blocking traffic
# - Health checks failing
# - Wrong backend servers configured

# Verify:
curl -v https://192.168.2.200/
# Check for SSL errors, connection refused, etc.

Cause 3: Split-Horizon DNS / GeoDNS Issues

# If using GeoDNS, some regions might get different records
# Check what different regions see

import dns.resolver

resolvers = {
'US': '8.8.8.8',
'EU': '1.1.1.1',
'APAC': '9.9.9.9'
}

for region, resolver in resolvers.items():
r = dns.resolver.Resolver()
r.nameservers = [resolver]
try:
answer = r.resolve('example.com', 'A')
print(f"{region}: {[rdata.address for rdata in answer]}")
except Exception as e:
print(f"{region}: Error - {e}")

Cause 4: CNAME Chain Broken

# If you changed a CNAME target:
example.com CNAME → www.example.com → old-lb.example.com

# And old-lb.example.com was deleted or changed:
# The whole chain breaks!

# Check the chain:
dig +trace example.com

Emergency Fixes

Fix 1: Rollback DNS

# Fastest fix: revert the DNS change
# Restore old A record pointing to old load balancer

# Using AWS Route 53 CLI:
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "192.168.1.100"}]
}
}]
}'

Fix 2: Keep Both Endpoints Running

# Don't turn off old load balancer until DNS fully propagates
# Run both old and new for 48-72 hours

# Both should point to same backend servers
OLD_LB (192.168.1.100) → Backend Pool
NEW_LB (192.168.2.200) → Backend Pool

Fix 3: Use DNS Failover

# Configure health-checked DNS failover
resource "aws_route53_health_check" "primary" {
ip_address = "192.168.2.200"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
}

resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60

failover_routing_policy {
type = "PRIMARY"
}

health_check_id = aws_route53_health_check.primary.id
records = ["192.168.2.200"]
}

resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60

failover_routing_policy {
type = "SECONDARY"
}

records = ["192.168.1.100"] # Old LB as fallback
}

Fix 4: Weighted DNS for Gradual Migration

# Route 10% to new, 90% to old initially
resource "aws_route53_record" "new" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60

weighted_routing_policy {
weight = 10
}

set_identifier = "new-lb"
records = ["192.168.2.200"]
}

resource "aws_route53_record" "old" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60

weighted_routing_policy {
weight = 90
}

set_identifier = "old-lb"
records = ["192.168.1.100"]
}

# Gradually shift: 10/90 → 50/50 → 90/10 → 100/0

Safe DNS Change Procedure

dns_change_checklist:

before_change:
- [ ] Lower TTL to 60-300 seconds
- [ ] Wait 2x old TTL for propagation
- [ ] Verify new endpoint is healthy
- [ ] Prepare rollback plan
- [ ] Schedule during low traffic
- [ ] Alert team of maintenance window

during_change:
- [ ] Make DNS change
- [ ] Monitor both old and new endpoints
- [ ] Watch error rates in monitoring
- [ ] Check propagation with multiple DNS servers

after_change:
- [ ] Keep old endpoint running 48-72 hours
- [ ] Verify traffic shifting to new endpoint
- [ ] Check for errors from both endpoints
- [ ] Restore TTL to normal (3600+) after stable
- [ ] Decommission old endpoint only when traffic = 0

Monitoring DNS Changes

# dns_monitor.py
import dns.resolver
import time
from prometheus_client import Gauge, start_http_server

dns_propagation = Gauge('dns_propagation_complete',
'DNS propagation status by resolver',
['resolver'])

RESOLVERS = {
'google': '8.8.8.8',
'cloudflare': '1.1.1.1',
'opendns': '208.67.222.222'
}

EXPECTED_IP = '192.168.2.200' # New IP we're migrating to

def check_dns(domain, resolver):
r = dns.resolver.Resolver()
r.nameservers = [resolver]
try:
answers = r.resolve(domain, 'A')
return any(rdata.address == EXPECTED_IP for rdata in answers)
except:
return False

def monitor_propagation(domain):
while True:
for name, resolver in RESOLVERS.items():
result = check_dns(domain, resolver)
dns_propagation.labels(resolver=name).set(1 if result else 0)
print(f"{name}: {'✅' if result else '❌'}")

all_propagated = all(
dns_propagation.labels(resolver=name)._value._value
for name in RESOLVERS
)
if all_propagated:
print("🎉 DNS fully propagated!")

time.sleep(60)

if __name__ == '__main__':
start_http_server(8000)
monitor_propagation('example.com')

Key Takeaways

  1. Lower TTL first - Days before the actual change
  2. Keep old running - For 48-72 hours minimum
  3. Test new endpoint - Before switching DNS
  4. Use weighted routing - For gradual migration
  5. Monitor propagation - Check multiple resolvers
  6. Have rollback ready - Can revert in seconds

Golden rule: DNS changes are "eventually consistent" - plan for worst case TTL caching.