DNS Change Gone Wrong
The Interview Question
"We changed our DNS from old load balancer to new one. Half our users can access the site, half can't. It's been 4 hours since the change. What's happening and how do we fix it?"
Asked at: Amazon, Cloudflare, any company with global infrastructure
Time to solve: 25-30 minutes
Difficulty: ⭐⭐⭐ (Senior SRE)
Clarifying Questions to Ask
- "What was the TTL on the old record?" → Determines cache duration
- "Did we test the new endpoint before switching?" → Might be misconfigured
- "Are affected users in specific regions?" → DNS propagation varies by region
- "Can users access via IP directly?" → Isolates DNS vs app issue
- "What kind of DNS change?" → A record, CNAME, nameserver change?
Why DNS Changes Take So Long
The problem: Even if you set TTL=300 (5 min), some resolvers ignore it!
Google DNS (8.8.8.8): Usually respects TTL
ISP resolvers: Often cache longer (hours to days)
Corporate proxies: May cache indefinitely
CDN edge caches: Have their own TTL
Immediate Diagnosis
Step 1: Check DNS Propagation
# Check what different DNS servers see
dig example.com @8.8.8.8 # Google
dig example.com @1.1.1.1 # Cloudflare
dig example.com @208.67.222.222 # OpenDNS
dig example.com @9.9.9.9 # Quad9
# Check authoritative servers directly
dig example.com @ns1.your-dns-provider.com
# Check remaining TTL on cached record
dig example.com +ttlunits
# Full propagation check
for ns in 8.8.8.8 1.1.1.1 208.67.222.222; do
echo "=== $ns ==="
dig +short example.com @$ns
done
Step 2: Identify What Changed
# Compare old vs new records
# Old: 192.168.1.100 (old LB)
# New: 192.168.2.200 (new LB)
# Check WHOIS for recent changes
whois example.com | grep -i "updated"
Step 3: Verify New Endpoint Works
# Bypass DNS, test directly
curl -H "Host: example.com" http://192.168.2.200/health
# Check if SSL works with new IP
openssl s_client -connect 192.168.2.200:443 -servername example.com
Root Causes
Cause 1: High TTL on Old Records
# If old TTL was 86400 (24 hours)
# Users will keep using old IP for up to 24 hours!
# Solution: Should have lowered TTL BEFORE the change
# 1. Lower TTL to 300 (5 min)
# 2. Wait for old TTL to expire (24+ hours)
# 3. THEN make the change
# 4. Restore TTL after change is stable
Cause 2: New Load Balancer Not Ready
# Common issues:
# - SSL certificate not installed
# - Security group blocking traffic
# - Health checks failing
# - Wrong backend servers configured
# Verify:
curl -v https://192.168.2.200/
# Check for SSL errors, connection refused, etc.
Cause 3: Split-Horizon DNS / GeoDNS Issues
# If using GeoDNS, some regions might get different records
# Check what different regions see
import dns.resolver
resolvers = {
'US': '8.8.8.8',
'EU': '1.1.1.1',
'APAC': '9.9.9.9'
}
for region, resolver in resolvers.items():
r = dns.resolver.Resolver()
r.nameservers = [resolver]
try:
answer = r.resolve('example.com', 'A')
print(f"{region}: {[rdata.address for rdata in answer]}")
except Exception as e:
print(f"{region}: Error - {e}")
Cause 4: CNAME Chain Broken
# If you changed a CNAME target:
example.com CNAME → www.example.com → old-lb.example.com
# And old-lb.example.com was deleted or changed:
# The whole chain breaks!
# Check the chain:
dig +trace example.com
Emergency Fixes
Fix 1: Rollback DNS
# Fastest fix: revert the DNS change
# Restore old A record pointing to old load balancer
# Using AWS Route 53 CLI:
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "192.168.1.100"}]
}
}]
}'
Fix 2: Keep Both Endpoints Running
# Don't turn off old load balancer until DNS fully propagates
# Run both old and new for 48-72 hours
# Both should point to same backend servers
OLD_LB (192.168.1.100) → Backend Pool
NEW_LB (192.168.2.200) → Backend Pool
Fix 3: Use DNS Failover
# Configure health-checked DNS failover
resource "aws_route53_health_check" "primary" {
ip_address = "192.168.2.200"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
}
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
records = ["192.168.2.200"]
}
resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60
failover_routing_policy {
type = "SECONDARY"
}
records = ["192.168.1.100"] # Old LB as fallback
}
Fix 4: Weighted DNS for Gradual Migration
# Route 10% to new, 90% to old initially
resource "aws_route53_record" "new" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60
weighted_routing_policy {
weight = 10
}
set_identifier = "new-lb"
records = ["192.168.2.200"]
}
resource "aws_route53_record" "old" {
zone_id = aws_route53_zone.main.zone_id
name = "example.com"
type = "A"
ttl = 60
weighted_routing_policy {
weight = 90
}
set_identifier = "old-lb"
records = ["192.168.1.100"]
}
# Gradually shift: 10/90 → 50/50 → 90/10 → 100/0
Safe DNS Change Procedure
dns_change_checklist:
before_change:
- [ ] Lower TTL to 60-300 seconds
- [ ] Wait 2x old TTL for propagation
- [ ] Verify new endpoint is healthy
- [ ] Prepare rollback plan
- [ ] Schedule during low traffic
- [ ] Alert team of maintenance window
during_change:
- [ ] Make DNS change
- [ ] Monitor both old and new endpoints
- [ ] Watch error rates in monitoring
- [ ] Check propagation with multiple DNS servers
after_change:
- [ ] Keep old endpoint running 48-72 hours
- [ ] Verify traffic shifting to new endpoint
- [ ] Check for errors from both endpoints
- [ ] Restore TTL to normal (3600+) after stable
- [ ] Decommission old endpoint only when traffic = 0
Monitoring DNS Changes
# dns_monitor.py
import dns.resolver
import time
from prometheus_client import Gauge, start_http_server
dns_propagation = Gauge('dns_propagation_complete',
'DNS propagation status by resolver',
['resolver'])
RESOLVERS = {
'google': '8.8.8.8',
'cloudflare': '1.1.1.1',
'opendns': '208.67.222.222'
}
EXPECTED_IP = '192.168.2.200' # New IP we're migrating to
def check_dns(domain, resolver):
r = dns.resolver.Resolver()
r.nameservers = [resolver]
try:
answers = r.resolve(domain, 'A')
return any(rdata.address == EXPECTED_IP for rdata in answers)
except:
return False
def monitor_propagation(domain):
while True:
for name, resolver in RESOLVERS.items():
result = check_dns(domain, resolver)
dns_propagation.labels(resolver=name).set(1 if result else 0)
print(f"{name}: {'✅' if result else '❌'}")
all_propagated = all(
dns_propagation.labels(resolver=name)._value._value
for name in RESOLVERS
)
if all_propagated:
print("🎉 DNS fully propagated!")
time.sleep(60)
if __name__ == '__main__':
start_http_server(8000)
monitor_propagation('example.com')
Key Takeaways
- Lower TTL first - Days before the actual change
- Keep old running - For 48-72 hours minimum
- Test new endpoint - Before switching DNS
- Use weighted routing - For gradual migration
- Monitor propagation - Check multiple resolvers
- Have rollback ready - Can revert in seconds
Golden rule: DNS changes are "eventually consistent" - plan for worst case TTL caching.