DNS Failover Incident Handling Guide

DNS failover is a critical mechanism that automatically redirects traffic from a failed primary server to a backup server when the primary becomes unavailable. This guide covers common DNS failover incidents, troubleshooting steps, and best practices for maintaining high availability.

Overview

DNS failover ensures service continuity by automatically switching traffic to healthy endpoints when primary services fail. It's essential for maintaining uptime and providing seamless user experiences during infrastructure failures.

Common DNS Failover Scenarios

1. Primary Server Failure

Scenario: Primary server becomes completely unavailable

Primary: web1.example.com (192.168.1.10) - DOWN
Backup:  web2.example.com (192.168.1.11) - UP
Result:  Traffic automatically routed to backup

2. Partial Service Degradation

Scenario: Primary server responds but with high latency or errors

Primary: web1.example.com - 5s response time, 50% error rate
Backup:  web2.example.com - 200ms response time, 0% error rate
Result:  Health checks fail, traffic switches to backup

3. Geographic Failover

Scenario: Regional data center failure

US-East:   api-us.example.com - DOWN (data center outage)
US-West:   api-west.example.com - UP
Europe:    api-eu.example.com - UP
Result:    Traffic distributed to healthy regions

DNS Failover Types

1. Active-Passive Failover

# Primary DNS Record
api.example.com.    300    IN    A    192.168.1.10
api.example.com.    300    IN    A    192.168.1.11

# Health Check Configuration
health_check:
  primary: 192.168.1.10
  backup: 192.168.1.11
  interval: 30s
  timeout: 5s
  failure_threshold: 3

2. Active-Active Failover

# Load Balanced DNS Records
api.example.com.    300    IN    A    192.168.1.10
api.example.com.    300    IN    A    192.168.1.11
api.example.com.    300    IN    A    192.168.1.12

# Health Check Configuration
health_check:
  endpoints:
    - 192.168.1.10
    - 192.168.1.11
    - 192.168.1.12
  interval: 30s
  timeout: 5s
  failure_threshold: 2

3. Weighted Round Robin with Failover

# Weighted DNS Records
api.example.com.    300    IN    A    192.168.1.10    ; Weight: 70%
api.example.com.    300    IN    A    192.168.1.11    ; Weight: 30%

# Health Check Configuration
health_check:
  primary: 192.168.1.10
  backup: 192.168.1.11
  weights:
    primary: 70
    backup: 30
  interval: 30s
  timeout: 5s

Common DNS Failover Incidents

1. False Positive Failover

Symptoms:

Traffic switches to backup unnecessarily
Primary server is actually healthy
Users experience temporary service disruption

Root Causes:

Network connectivity issues between health checker and primary
Health check endpoint misconfiguration
DNS propagation delays
Firewall blocking health check traffic

Troubleshooting Steps:

# Check primary server health
curl -I http://192.168.1.10/health
ping 192.168.1.10

# Verify DNS resolution
dig api.example.com
nslookup api.example.com

# Check health check logs
tail -f /var/log/health-checker.log

# Verify network connectivity
traceroute 192.168.1.10
telnet 192.168.1.10 80

Resolution:

Fix network connectivity issues
Adjust health check parameters
Implement more robust health check endpoints
Add monitoring for health check failures

2. Failover Not Triggering

Symptoms:

Primary server is down but traffic still routes to it
Users cannot access the service
Backup server remains unused

Root Causes:

Health check service is down
DNS TTL too high
Health check configuration errors
DNS provider issues

Troubleshooting Steps:

# Check health check service status
systemctl status health-checker
ps aux | grep health-checker

# Verify health check configuration
cat /etc/health-checker/config.yaml

# Test health check manually
./health-checker --test --endpoint 192.168.1.10

# Check DNS provider status
curl -s https://api.dns-provider.com/status

# Verify DNS propagation
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

Resolution:

Restart health check service
Fix health check configuration
Reduce DNS TTL for faster propagation
Contact DNS provider if needed

3. Split-Brain Scenario

Symptoms:

Traffic distributed between primary and backup
Inconsistent user experiences
Data synchronization issues

Root Causes:

Health check inconsistencies
DNS propagation delays
Multiple health check systems
Network partitioning

Troubleshooting Steps:

# Check health status from multiple locations
curl -I http://192.168.1.10/health
curl -I http://192.168.1.11/health

# Verify DNS resolution from different locations
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com
dig @208.67.222.222 api.example.com

# Check health check logs
tail -f /var/log/health-checker-primary.log
tail -f /var/log/health-checker-backup.log

# Monitor traffic distribution
netstat -an | grep :80 | wc -l

Resolution:

Implement consistent health check logic
Use single source of truth for health status
Add coordination between health check systems
Implement proper quorum mechanisms

DNS Failover Implementation

1. Cloud Provider Solutions

AWS Route 53 Health Checks

# Route 53 Health Check Configuration
health_check:
  type: HTTP
  resource_path: /health
  port: 80
  protocol: HTTP
  request_interval: 30
  failure_threshold: 3
  measure_latency: true
  enable_sni: false
  regions:
    - us-east-1
    - us-west-2
    - eu-west-1

# DNS Record with Failover
record:
  name: api.example.com
  type: A
  failover: PRIMARY
  health_check_id: "12345678-1234-1234-1234-123456789012"
  value: 192.168.1.10

Google Cloud DNS Health Checks

# Cloud DNS Health Check
health_check:
  name: api-health-check
  type: HTTP
  port: 80
  path: /health
  interval: 30s
  timeout: 5s
  unhealthy_threshold: 3
  healthy_threshold: 2

# DNS Policy with Failover
dns_policy:
  name: api-failover-policy
  primary:
    - target: 192.168.1.10
      health_check: api-health-check
  backup:
    - target: 192.168.1.11

2. Self-Hosted Solutions

PowerDNS with Health Checks

# PowerDNS Configuration
pdns:
  health_check:
    enabled: true
    interval: 30
    timeout: 5
    retries: 3
    backend: "pipe"
    command: "/usr/local/bin/health-checker"

  records:
    - name: api.example.com
      type: A
      ttl: 300
      primary: 192.168.1.10
      backup: 192.168.1.11

Custom Health Check Script

#!/bin/bash
# health-checker.sh

PRIMARY_IP="192.168.1.10"
BACKUP_IP="192.168.1.11"
DOMAIN="api.example.com"
TTL=300

check_health() {
    local ip=$1
    local response=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "http://${ip}/health")

    if [ "$response" = "200" ]; then
        return 0
    else
        return 1
    fi
}

update_dns() {
    local primary_ip=$1
    local backup_ip=$2

    # Update DNS record to point to healthy server
    pdns_control add-record $DOMAIN A $primary_ip $TTL
}

# Main health check logic
if check_health $PRIMARY_IP; then
    echo "Primary server is healthy"
    update_dns $PRIMARY_IP $BACKUP_IP
else
    echo "Primary server is unhealthy, checking backup"
    if check_health $BACKUP_IP; then
        echo "Switching to backup server"
        update_dns $BACKUP_IP $PRIMARY_IP
    else
        echo "Both servers are unhealthy"
        exit 1
    fi
fi

Monitoring and Alerting

1. Key Metrics to Monitor

# Prometheus Metrics
metrics:
  - name: dns_failover_events_total
    type: counter
    description: "Total number of DNS failover events"
    labels: [domain, reason]

  - name: dns_health_check_duration_seconds
    type: histogram
    description: "Duration of DNS health checks"
    labels: [endpoint, status]

  - name: dns_resolution_time_seconds
    type: histogram
    description: "DNS resolution time"
    labels: [domain, resolver]

  - name: dns_failover_duration_seconds
    type: histogram
    description: "Time taken to complete failover"
    labels: [domain]

2. Alerting Rules

# AlertManager Rules
groups:
  - name: dns-failover
    rules:
      - alert: DNSFailoverTriggered
        expr: increase(dns_failover_events_total[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "DNS failover triggered for {{ $labels.domain }}"
          description: "DNS failover event detected for {{ $labels.domain }} due to {{ $labels.reason }}"

      - alert: DNSHealthCheckFailure
        expr: dns_health_check_duration_seconds{status="failure"} > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DNS health check failing for {{ $labels.endpoint }}"
          description: "Health check for {{ $labels.endpoint }} has been failing for more than 2 minutes"

      - alert: DNSResolutionSlow
        expr: dns_resolution_time_seconds > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DNS resolution is slow for {{ $labels.domain }}"
          description: "DNS resolution time for {{ $labels.domain }} is {{ $value }}s"

3. Dashboard Configuration

{
  "dashboard": {
    "title": "DNS Failover Monitoring",
    "panels": [
      {
        "title": "Failover Events",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(dns_failover_events_total[5m])",
            "legendFormat": "{{ domain }} - {{ reason }}"
          }
        ]
      },
      {
        "title": "Health Check Status",
        "type": "stat",
        "targets": [
          {
            "expr": "up{job=\"dns-health-check\"}",
            "legendFormat": "{{ endpoint }}"
          }
        ]
      },
      {
        "title": "DNS Resolution Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(dns_resolution_time_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ]
      }
    ]
  }
}

Best Practices

1. Health Check Design

# Comprehensive Health Check Endpoint
health_check:
  endpoint: /health
  checks:
    - name: database
      type: connection
      timeout: 2s
    - name: cache
      type: connection
      timeout: 1s
    - name: external_api
      type: http
      url: "https://api.external.com/status"
      timeout: 3s
    - name: disk_space
      type: system
      threshold: 90%
    - name: memory
      type: system
      threshold: 85%

  response_format:
    status: "healthy|unhealthy"
    checks:
      - name: "check_name"
        status: "pass|fail"
        duration_ms: 123
        error: "error_message"

2. DNS TTL Management

# TTL Strategy
ttl_strategy:
  normal: 300 # 5 minutes for normal operation
  failover: 60 # 1 minute during failover
  emergency: 30 # 30 seconds for emergency situations

# Dynamic TTL based on health status
dynamic_ttl:
  healthy: 300
  degraded: 120
  unhealthy: 60

3. Failover Testing

#!/bin/bash
# failover-test.sh

# Test script for DNS failover scenarios
test_failover() {
    local domain=$1
    local primary_ip=$2
    local backup_ip=$3

    echo "Testing failover for $domain"

    # 1. Verify initial state
    echo "Initial DNS resolution:"
    dig +short $domain

    # 2. Simulate primary failure
    echo "Simulating primary failure..."
    iptables -A INPUT -s $primary_ip -j DROP

    # 3. Wait for failover
    echo "Waiting for failover..."
    sleep 60

    # 4. Check DNS resolution
    echo "DNS resolution after failover:"
    dig +short $domain

    # 5. Restore primary
    echo "Restoring primary..."
    iptables -D INPUT -s $primary_ip -j DROP

    # 6. Wait for recovery
    echo "Waiting for recovery..."
    sleep 60

    # 7. Final check
    echo "Final DNS resolution:"
    dig +short $domain
}

# Run tests
test_failover "api.example.com" "192.168.1.10" "192.168.1.11"

Troubleshooting Checklist

1. Pre-Incident Preparation

Document all DNS records and their purposes
Maintain updated contact information for DNS providers
Test failover procedures regularly
Monitor DNS resolution from multiple locations
Keep health check endpoints simple and reliable

2. During Incident Response

Verify the scope of the DNS issue
Check health of all endpoints
Review health check logs and metrics
Test DNS resolution from multiple locations
Communicate with stakeholders about the issue
Document all actions taken

3. Post-Incident Review

Analyze root cause of the incident
Review failover timing and effectiveness
Update monitoring and alerting rules
Improve health check endpoints if needed
Update runbooks and procedures
Conduct lessons learned session

Common Tools and Commands

1. DNS Testing Tools

# Basic DNS resolution
dig api.example.com
nslookup api.example.com

# DNS resolution from specific server
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

# Trace DNS resolution path
dig +trace api.example.com

# Check DNS record types
dig api.example.com ANY
dig api.example.com A
dig api.example.com AAAA

# Monitor DNS changes
watch -n 5 'dig +short api.example.com'

2. Health Check Tools

# HTTP health check
curl -I http://192.168.1.10/health
curl -f http://192.168.1.10/health

# TCP connectivity check
telnet 192.168.1.10 80
nc -zv 192.168.1.10 80

# Ping test
ping -c 4 192.168.1.10

# Port scan
nmap -p 80,443 192.168.1.10

3. Monitoring Commands

# Check DNS resolution time
time dig api.example.com

# Monitor network connectivity
mtr 192.168.1.10

# Check routing
traceroute 192.168.1.10

# Monitor DNS queries
tcpdump -i any port 53

# Check system resources
top
htop
iostat

Conclusion

DNS failover is a critical component of high-availability infrastructure. Proper implementation, monitoring, and testing of DNS failover mechanisms can significantly reduce downtime and improve user experience. Regular testing, comprehensive monitoring, and well-documented procedures are essential for maintaining reliable DNS failover systems.

Remember to:

Keep health checks simple and reliable
Monitor from multiple locations
Test failover procedures regularly
Maintain clear documentation
Have proper alerting in place
Review and improve based on incidents

Overview​

Common DNS Failover Scenarios​

1. Primary Server Failure​

2. Partial Service Degradation​

3. Geographic Failover​

DNS Failover Types​

1. Active-Passive Failover​

2. Active-Active Failover​

3. Weighted Round Robin with Failover​

Common DNS Failover Incidents​

1. False Positive Failover​

2. Failover Not Triggering​

3. Split-Brain Scenario​

DNS Failover Implementation​

1. Cloud Provider Solutions​

AWS Route 53 Health Checks​

Google Cloud DNS Health Checks​

2. Self-Hosted Solutions​

PowerDNS with Health Checks​

Custom Health Check Script​

Monitoring and Alerting​

1. Key Metrics to Monitor​

2. Alerting Rules​

3. Dashboard Configuration​

Best Practices​

1. Health Check Design​

2. DNS TTL Management​

3. Failover Testing​

Troubleshooting Checklist​

1. Pre-Incident Preparation​

2. During Incident Response​

3. Post-Incident Review​

Common Tools and Commands​

1. DNS Testing Tools​

2. Health Check Tools​

3. Monitoring Commands​

Conclusion​

Overview

Common DNS Failover Scenarios

1. Primary Server Failure

2. Partial Service Degradation

3. Geographic Failover

DNS Failover Types

1. Active-Passive Failover

2. Active-Active Failover

3. Weighted Round Robin with Failover

Common DNS Failover Incidents

1. False Positive Failover

2. Failover Not Triggering

3. Split-Brain Scenario

DNS Failover Implementation

1. Cloud Provider Solutions

AWS Route 53 Health Checks

Google Cloud DNS Health Checks

2. Self-Hosted Solutions

PowerDNS with Health Checks

Custom Health Check Script

Monitoring and Alerting

1. Key Metrics to Monitor

2. Alerting Rules

3. Dashboard Configuration

Best Practices

1. Health Check Design

2. DNS TTL Management

3. Failover Testing

Troubleshooting Checklist

1. Pre-Incident Preparation

2. During Incident Response

3. Post-Incident Review

Common Tools and Commands

1. DNS Testing Tools

2. Health Check Tools

3. Monitoring Commands

Conclusion