Skip to main content

DNS Failover Incident Handling Guide

DNS failover is a critical mechanism that automatically redirects traffic from a failed primary server to a backup server when the primary becomes unavailable. This guide covers common DNS failover incidents, troubleshooting steps, and best practices for maintaining high availability.

Overview

DNS failover ensures service continuity by automatically switching traffic to healthy endpoints when primary services fail. It's essential for maintaining uptime and providing seamless user experiences during infrastructure failures.

Common DNS Failover Scenarios

1. Primary Server Failure

Scenario: Primary server becomes completely unavailable

Primary: web1.example.com (192.168.1.10) - DOWN
Backup: web2.example.com (192.168.1.11) - UP
Result: Traffic automatically routed to backup

2. Partial Service Degradation

Scenario: Primary server responds but with high latency or errors

Primary: web1.example.com - 5s response time, 50% error rate
Backup: web2.example.com - 200ms response time, 0% error rate
Result: Health checks fail, traffic switches to backup

3. Geographic Failover

Scenario: Regional data center failure

US-East:   api-us.example.com - DOWN (data center outage)
US-West: api-west.example.com - UP
Europe: api-eu.example.com - UP
Result: Traffic distributed to healthy regions

DNS Failover Types

1. Active-Passive Failover

# Primary DNS Record
api.example.com. 300 IN A 192.168.1.10
api.example.com. 300 IN A 192.168.1.11

# Health Check Configuration
health_check:
primary: 192.168.1.10
backup: 192.168.1.11
interval: 30s
timeout: 5s
failure_threshold: 3

2. Active-Active Failover

# Load Balanced DNS Records
api.example.com. 300 IN A 192.168.1.10
api.example.com. 300 IN A 192.168.1.11
api.example.com. 300 IN A 192.168.1.12

# Health Check Configuration
health_check:
endpoints:
- 192.168.1.10
- 192.168.1.11
- 192.168.1.12
interval: 30s
timeout: 5s
failure_threshold: 2

3. Weighted Round Robin with Failover

# Weighted DNS Records
api.example.com. 300 IN A 192.168.1.10 ; Weight: 70%
api.example.com. 300 IN A 192.168.1.11 ; Weight: 30%

# Health Check Configuration
health_check:
primary: 192.168.1.10
backup: 192.168.1.11
weights:
primary: 70
backup: 30
interval: 30s
timeout: 5s

Common DNS Failover Incidents

1. False Positive Failover

Symptoms:

  • Traffic switches to backup unnecessarily
  • Primary server is actually healthy
  • Users experience temporary service disruption

Root Causes:

  • Network connectivity issues between health checker and primary
  • Health check endpoint misconfiguration
  • DNS propagation delays
  • Firewall blocking health check traffic

Troubleshooting Steps:

# Check primary server health
curl -I http://192.168.1.10/health
ping 192.168.1.10

# Verify DNS resolution
dig api.example.com
nslookup api.example.com

# Check health check logs
tail -f /var/log/health-checker.log

# Verify network connectivity
traceroute 192.168.1.10
telnet 192.168.1.10 80

Resolution:

  1. Fix network connectivity issues
  2. Adjust health check parameters
  3. Implement more robust health check endpoints
  4. Add monitoring for health check failures

2. Failover Not Triggering

Symptoms:

  • Primary server is down but traffic still routes to it
  • Users cannot access the service
  • Backup server remains unused

Root Causes:

  • Health check service is down
  • DNS TTL too high
  • Health check configuration errors
  • DNS provider issues

Troubleshooting Steps:

# Check health check service status
systemctl status health-checker
ps aux | grep health-checker

# Verify health check configuration
cat /etc/health-checker/config.yaml

# Test health check manually
./health-checker --test --endpoint 192.168.1.10

# Check DNS provider status
curl -s https://api.dns-provider.com/status

# Verify DNS propagation
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

Resolution:

  1. Restart health check service
  2. Fix health check configuration
  3. Reduce DNS TTL for faster propagation
  4. Contact DNS provider if needed

3. Split-Brain Scenario

Symptoms:

  • Traffic distributed between primary and backup
  • Inconsistent user experiences
  • Data synchronization issues

Root Causes:

  • Health check inconsistencies
  • DNS propagation delays
  • Multiple health check systems
  • Network partitioning

Troubleshooting Steps:

# Check health status from multiple locations
curl -I http://192.168.1.10/health
curl -I http://192.168.1.11/health

# Verify DNS resolution from different locations
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com
dig @208.67.222.222 api.example.com

# Check health check logs
tail -f /var/log/health-checker-primary.log
tail -f /var/log/health-checker-backup.log

# Monitor traffic distribution
netstat -an | grep :80 | wc -l

Resolution:

  1. Implement consistent health check logic
  2. Use single source of truth for health status
  3. Add coordination between health check systems
  4. Implement proper quorum mechanisms

DNS Failover Implementation

1. Cloud Provider Solutions

AWS Route 53 Health Checks

# Route 53 Health Check Configuration
health_check:
type: HTTP
resource_path: /health
port: 80
protocol: HTTP
request_interval: 30
failure_threshold: 3
measure_latency: true
enable_sni: false
regions:
- us-east-1
- us-west-2
- eu-west-1

# DNS Record with Failover
record:
name: api.example.com
type: A
failover: PRIMARY
health_check_id: "12345678-1234-1234-1234-123456789012"
value: 192.168.1.10

Google Cloud DNS Health Checks

# Cloud DNS Health Check
health_check:
name: api-health-check
type: HTTP
port: 80
path: /health
interval: 30s
timeout: 5s
unhealthy_threshold: 3
healthy_threshold: 2

# DNS Policy with Failover
dns_policy:
name: api-failover-policy
primary:
- target: 192.168.1.10
health_check: api-health-check
backup:
- target: 192.168.1.11

2. Self-Hosted Solutions

PowerDNS with Health Checks

# PowerDNS Configuration
pdns:
health_check:
enabled: true
interval: 30
timeout: 5
retries: 3
backend: "pipe"
command: "/usr/local/bin/health-checker"

records:
- name: api.example.com
type: A
ttl: 300
primary: 192.168.1.10
backup: 192.168.1.11

Custom Health Check Script

#!/bin/bash
# health-checker.sh

PRIMARY_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
DOMAIN="api.example.com"
TTL=300

check_health() {
local ip=$1
local response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "http://${ip}/health")

if [ "$response" = "200" ]; then
return 0
else
return 1
fi
}

update_dns() {
local primary_ip=$1
local backup_ip=$2

# Update DNS record to point to healthy server
pdns_control add-record $DOMAIN A $primary_ip $TTL
}

# Main health check logic
if check_health $PRIMARY_IP; then
echo "Primary server is healthy"
update_dns $PRIMARY_IP $BACKUP_IP
else
echo "Primary server is unhealthy, checking backup"
if check_health $BACKUP_IP; then
echo "Switching to backup server"
update_dns $BACKUP_IP $PRIMARY_IP
else
echo "Both servers are unhealthy"
exit 1
fi
fi

Monitoring and Alerting

1. Key Metrics to Monitor

# Prometheus Metrics
metrics:
- name: dns_failover_events_total
type: counter
description: "Total number of DNS failover events"
labels: [domain, reason]

- name: dns_health_check_duration_seconds
type: histogram
description: "Duration of DNS health checks"
labels: [endpoint, status]

- name: dns_resolution_time_seconds
type: histogram
description: "DNS resolution time"
labels: [domain, resolver]

- name: dns_failover_duration_seconds
type: histogram
description: "Time taken to complete failover"
labels: [domain]

2. Alerting Rules

# AlertManager Rules
groups:
- name: dns-failover
rules:
- alert: DNSFailoverTriggered
expr: increase(dns_failover_events_total[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "DNS failover triggered for {{ $labels.domain }}"
description: "DNS failover event detected for {{ $labels.domain }} due to {{ $labels.reason }}"

- alert: DNSHealthCheckFailure
expr: dns_health_check_duration_seconds{status="failure"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "DNS health check failing for {{ $labels.endpoint }}"
description: "Health check for {{ $labels.endpoint }} has been failing for more than 2 minutes"

- alert: DNSResolutionSlow
expr: dns_resolution_time_seconds > 5
for: 5m
labels:
severity: warning
annotations:
summary: "DNS resolution is slow for {{ $labels.domain }}"
description: "DNS resolution time for {{ $labels.domain }} is {{ $value }}s"

3. Dashboard Configuration

{
"dashboard": {
"title": "DNS Failover Monitoring",
"panels": [
{
"title": "Failover Events",
"type": "graph",
"targets": [
{
"expr": "rate(dns_failover_events_total[5m])",
"legendFormat": "{{ domain }} - {{ reason }}"
}
]
},
{
"title": "Health Check Status",
"type": "stat",
"targets": [
{
"expr": "up{job=\"dns-health-check\"}",
"legendFormat": "{{ endpoint }}"
}
]
},
{
"title": "DNS Resolution Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(dns_resolution_time_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
}
]
}
}

Best Practices

1. Health Check Design

# Comprehensive Health Check Endpoint
health_check:
endpoint: /health
checks:
- name: database
type: connection
timeout: 2s
- name: cache
type: connection
timeout: 1s
- name: external_api
type: http
url: "https://api.external.com/status"
timeout: 3s
- name: disk_space
type: system
threshold: 90%
- name: memory
type: system
threshold: 85%

response_format:
status: "healthy|unhealthy"
checks:
- name: "check_name"
status: "pass|fail"
duration_ms: 123
error: "error_message"

2. DNS TTL Management

# TTL Strategy
ttl_strategy:
normal: 300 # 5 minutes for normal operation
failover: 60 # 1 minute during failover
emergency: 30 # 30 seconds for emergency situations

# Dynamic TTL based on health status
dynamic_ttl:
healthy: 300
degraded: 120
unhealthy: 60

3. Failover Testing

#!/bin/bash
# failover-test.sh

# Test script for DNS failover scenarios
test_failover() {
local domain=$1
local primary_ip=$2
local backup_ip=$3

echo "Testing failover for $domain"

# 1. Verify initial state
echo "Initial DNS resolution:"
dig +short $domain

# 2. Simulate primary failure
echo "Simulating primary failure..."
iptables -A INPUT -s $primary_ip -j DROP

# 3. Wait for failover
echo "Waiting for failover..."
sleep 60

# 4. Check DNS resolution
echo "DNS resolution after failover:"
dig +short $domain

# 5. Restore primary
echo "Restoring primary..."
iptables -D INPUT -s $primary_ip -j DROP

# 6. Wait for recovery
echo "Waiting for recovery..."
sleep 60

# 7. Final check
echo "Final DNS resolution:"
dig +short $domain
}

# Run tests
test_failover "api.example.com" "192.168.1.10" "192.168.1.11"

Troubleshooting Checklist

1. Pre-Incident Preparation

  • Document all DNS records and their purposes
  • Maintain updated contact information for DNS providers
  • Test failover procedures regularly
  • Monitor DNS resolution from multiple locations
  • Keep health check endpoints simple and reliable

2. During Incident Response

  • Verify the scope of the DNS issue
  • Check health of all endpoints
  • Review health check logs and metrics
  • Test DNS resolution from multiple locations
  • Communicate with stakeholders about the issue
  • Document all actions taken

3. Post-Incident Review

  • Analyze root cause of the incident
  • Review failover timing and effectiveness
  • Update monitoring and alerting rules
  • Improve health check endpoints if needed
  • Update runbooks and procedures
  • Conduct lessons learned session

Common Tools and Commands

1. DNS Testing Tools

# Basic DNS resolution
dig api.example.com
nslookup api.example.com

# DNS resolution from specific server
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

# Trace DNS resolution path
dig +trace api.example.com

# Check DNS record types
dig api.example.com ANY
dig api.example.com A
dig api.example.com AAAA

# Monitor DNS changes
watch -n 5 'dig +short api.example.com'

2. Health Check Tools

# HTTP health check
curl -I http://192.168.1.10/health
curl -f http://192.168.1.10/health

# TCP connectivity check
telnet 192.168.1.10 80
nc -zv 192.168.1.10 80

# Ping test
ping -c 4 192.168.1.10

# Port scan
nmap -p 80,443 192.168.1.10

3. Monitoring Commands

# Check DNS resolution time
time dig api.example.com

# Monitor network connectivity
mtr 192.168.1.10

# Check routing
traceroute 192.168.1.10

# Monitor DNS queries
tcpdump -i any port 53

# Check system resources
top
htop
iostat

Conclusion

DNS failover is a critical component of high-availability infrastructure. Proper implementation, monitoring, and testing of DNS failover mechanisms can significantly reduce downtime and improve user experience. Regular testing, comprehensive monitoring, and well-documented procedures are essential for maintaining reliable DNS failover systems.

Remember to:

  • Keep health checks simple and reliable
  • Monitor from multiple locations
  • Test failover procedures regularly
  • Maintain clear documentation
  • Have proper alerting in place
  • Review and improve based on incidents