Database Downtime Incidents & DR Plan: Master-Slave Switch

Overview

Database downtime incidents are among the most critical production issues, often causing complete service unavailability. This document covers database downtime scenarios, incident response procedures, and disaster recovery plans specifically for master-slave database architectures.

Common Database Downtime Scenarios

1. Hardware Failures

Disk failures: Storage corruption, disk space exhaustion
Memory issues: RAM failures, memory leaks
Network problems: Network interface failures, connectivity issues
Power outages: UPS failures, power supply issues

2. Software Issues

Database crashes: Segmentation faults, assertion failures
Deadlocks: Long-running transactions blocking system
Resource exhaustion: CPU, memory, or I/O saturation
Configuration errors: Incorrect settings causing instability

3. Operational Issues

Maintenance gone wrong: Failed upgrades, schema changes
Human errors: Accidental data deletion, wrong commands
Backup failures: Corrupted backups, failed restore operations
Security incidents: Unauthorized access, data breaches

4. External Dependencies

Cloud provider outages: AWS RDS, Azure SQL, GCP Cloud SQL
Network connectivity: ISP issues, DNS problems
Third-party services: Monitoring, backup services down

Master-Slave Architecture Overview

Architecture Components

┌─────────────────┐    ┌─────────────────┐
│   Application   │    │   Application   │
│     Layer       │    │     Layer       │
└─────────┬───────┘    └─────────┬───────┘
          │                      │
          │                      │
┌─────────▼───────┐    ┌─────────▼───────┐
│   Load Balancer │    │   Load Balancer │
│   / Proxy       │    │   / Proxy       │
└─────────┬───────┘    └─────────┬───────┘
          │                      │
          └──────────┬───────────┘
                     │
          ┌─────────▼───────┐
          │   Master DB     │
          │   (Read/Write)  │
          └─────────┬───────┘
                    │
          ┌─────────▼───────┐
          │   Slave DB      │
          │   (Read Only)   │
          └─────────────────┘

Replication Types

Synchronous Replication: Data written to slave before commit
Asynchronous Replication: Data replicated after commit
Semi-synchronous Replication: Hybrid approach

Incident Detection and Alerting

1. Health Check Endpoints

# Database connectivity check
curl -f http://db-master:3306/health || echo "Master DB down"

# Replication lag check
mysql -h db-master -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master

# Application-level checks
curl -f http://app:8080/health/db || echo "DB health check failed"

2. Monitoring Metrics

# Key metrics to monitor
database_metrics:
  - connection_count
  - query_response_time
  - replication_lag
  - disk_usage
  - memory_usage
  - cpu_usage
  - deadlock_count
  - slow_query_count

3. Alerting Rules

# Example Prometheus alerting rules
groups:
  - name: database.rules
    rules:
      - alert: DatabaseDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database instance is down"

      - alert: HighReplicationLag
        expr: mysql_slave_lag_seconds > 60
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag detected"

Disaster Recovery Plan: Master-Slave Switch

Phase 1: Immediate Response (0-5 minutes)

1.1 Incident Confirmation

# Verify master database status
mysql -h db-master -e "SELECT 1" || echo "Master DB confirmed down"

# Check application impact
curl -f http://app:8080/health || echo "Application affected"

# Verify slave database status
mysql -h db-slave -e "SELECT 1" || echo "Slave DB status check"

1.2 Communication

Internal: Notify on-call team, engineering leads
External: Update status page, notify customers if needed
Documentation: Start incident timeline

1.3 Initial Assessment

# Check replication status
mysql -h db-slave -e "SHOW SLAVE STATUS\G"

# Verify data consistency
mysql -h db-slave -e "SELECT COUNT(*) FROM critical_table"

# Check application logs
tail -f /var/log/app/error.log | grep -i "database\|connection"

Phase 2: Failover Decision (5-15 minutes)

2.1 Failover Criteria

Immediate Failover Required:

Master database completely unreachable
Data corruption detected
Security breach confirmed
Hardware failure confirmed

Delayed Failover (Investigate First):

Network connectivity issues
High load causing timeouts
Replication lag issues
Performance degradation

2.2 Pre-Failover Checklist

# Verify slave is up-to-date
mysql -h db-slave -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master

# Check for any running transactions
mysql -h db-slave -e "SHOW PROCESSLIST"

# Verify backup availability
ls -la /backup/latest/

# Confirm application can connect to slave
mysql -h db-slave -e "SELECT 1"

Phase 3: Failover Execution (15-30 minutes)

3.1 Stop Application Traffic

# Update load balancer to stop routing to master
# Method depends on your load balancer (HAProxy, Nginx, AWS ALB, etc.)

# Example for HAProxy
echo "disable server db-master/db-master" | socat stdio /var/run/haproxy/admin.sock

# Example for Nginx
# Comment out master in upstream configuration
# Reload nginx: nginx -s reload

3.2 Promote Slave to Master

-- Stop replication on slave
STOP SLAVE;

-- Reset slave configuration
RESET SLAVE ALL;

-- Promote to master (MySQL 8.0+)
SET GLOBAL read_only = OFF;

-- For older MySQL versions
-- UNLOCK TABLES;

3.3 Update Application Configuration

# Update database connection strings
# Method depends on your configuration management

# Example: Update environment variables
export DB_HOST=db-slave
export DB_MASTER_HOST=db-slave

# Restart application services
systemctl restart app-service

3.4 Verify Failover

# Test database connectivity
mysql -h db-slave -e "SELECT 1"

# Test application functionality
curl -f http://app:8080/health/db

# Verify critical operations
curl -X POST http://app:8080/api/test-write
curl -f http://app:8080/api/test-read

Phase 4: Post-Failover (30-60 minutes)

4.1 Monitor System Health

# Monitor application performance
watch -n 5 'curl -s http://app:8080/metrics | grep db_'

# Check database performance
mysql -h db-slave -e "SHOW PROCESSLIST"
mysql -h db-slave -e "SHOW ENGINE INNODB STATUS\G"

4.2 Data Consistency Check

-- Compare critical table counts
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT COUNT(*) FROM transactions;

-- Check for any data integrity issues
SELECT * FROM users WHERE created_at > NOW() - INTERVAL 1 HOUR;

4.3 Communication Updates

Update status page with progress
Notify stakeholders of successful failover
Document lessons learned

Phase 5: Recovery Planning (1-24 hours)

5.1 Root Cause Analysis

# Collect logs from failed master
scp db-master:/var/log/mysql/error.log ./master-error.log

# Analyze system logs
journalctl -u mysql --since "1 hour ago" > mysql-system.log

# Check hardware logs
dmesg | grep -i error > hardware-errors.log

5.2 Recovery Options

Option 1: Repair and Rebuild Master

# If hardware issue, replace hardware
# If software issue, reinstall and restore from backup
# Reconfigure as new slave

Option 2: Keep Current Setup

# If failover was successful and stable
# Keep current slave as new master
# Set up new slave from current master

5.3 Setup New Slave

# Create backup of current master (former slave)
mysqldump --single-transaction --routines --triggers --all-databases > master-backup.sql

# Set up new slave server
# Restore backup to new slave
mysql < master-backup.sql

# Configure replication
CHANGE MASTER TO
  MASTER_HOST='current-master',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=12345;

START SLAVE;

Automation Scripts

1. Failover Script

#!/bin/bash
# failover.sh - Automated master-slave failover

set -e

MASTER_HOST="db-master"
SLAVE_HOST="db-slave"
APP_HOST="app-server"

echo "Starting failover process..."

# Check slave status
echo "Checking slave status..."
mysql -h $SLAVE_HOST -e "SHOW SLAVE STATUS\G" | grep -E "(Slave_IO_Running|Slave_SQL_Running)"

# Stop replication
echo "Stopping replication..."
mysql -h $SLAVE_HOST -e "STOP SLAVE;"

# Promote slave to master
echo "Promoting slave to master..."
mysql -h $SLAVE_HOST -e "RESET SLAVE ALL;"
mysql -h $SLAVE_HOST -e "SET GLOBAL read_only = OFF;"

# Update application configuration
echo "Updating application configuration..."
ssh $APP_HOST "export DB_HOST=$SLAVE_HOST && systemctl restart app-service"

# Verify failover
echo "Verifying failover..."
sleep 10
curl -f http://$APP_HOST:8080/health/db || exit 1

echo "Failover completed successfully!"

2. Health Check Script

#!/bin/bash
# health-check.sh - Database health monitoring

MASTER_HOST="db-master"
SLAVE_HOST="db-slave"

check_database() {
    local host=$1
    local name=$2

    echo "Checking $name ($host)..."

    if mysql -h $host -e "SELECT 1" > /dev/null 2>&1; then
        echo "✓ $name is accessible"

        # Check replication lag for slave
        if [ "$name" = "Slave" ]; then
            lag=$(mysql -h $host -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
            if [ "$lag" != "NULL" ] && [ "$lag" -lt 60 ]; then
                echo "✓ Replication lag: ${lag}s"
            else
                echo "⚠ High replication lag: ${lag}s"
            fi
        fi
    else
        echo "✗ $name is not accessible"
        return 1
    fi
}

check_database $MASTER_HOST "Master"
check_database $SLAVE_HOST "Slave"

Testing and Validation

1. Failover Testing

# Schedule regular failover tests
# Test during maintenance windows

# Simulate master failure
iptables -A INPUT -p tcp --dport 3306 -j DROP

# Execute failover procedure
./failover.sh

# Verify application functionality
./test-application.sh

# Restore master
iptables -D INPUT -p tcp --dport 3306 -j DROP

2. Load Testing

# Test application under load after failover
ab -n 10000 -c 100 http://app:8080/api/read-heavy-endpoint
ab -n 1000 -c 10 -p data.json -T application/json http://app:8080/api/write-endpoint

3. Data Integrity Testing

-- Compare data between master and slave
SELECT
    'users' as table_name,
    (SELECT COUNT(*) FROM users) as master_count,
    (SELECT COUNT(*) FROM slave.users) as slave_count
UNION ALL
SELECT
    'orders' as table_name,
    (SELECT COUNT(*) FROM orders) as master_count,
    (SELECT COUNT(*) FROM slave.orders) as slave_count;

Monitoring and Alerting

1. Key Metrics

# Database metrics to monitor
database_health:
  - connection_count
  - query_response_time
  - replication_lag
  - disk_usage
  - memory_usage
  - deadlock_count
  - slow_query_count
  - backup_status
  - failover_readiness

2. Alerting Rules

# Critical alerts
critical_alerts:
  - database_down
  - replication_stopped
  - high_replication_lag
  - disk_space_critical
  - memory_exhaustion

# Warning alerts
warning_alerts:
  - slow_queries_increase
  - connection_count_high
  - backup_failed
  - disk_usage_high

Best Practices

1. Prevention

Regular backups: Automated, tested backups
Monitoring: Comprehensive database monitoring
Capacity planning: Proactive resource management
Security: Regular security updates and audits
Documentation: Keep runbooks updated

2. Preparation

Regular testing: Monthly failover drills
Team training: Ensure team knows procedures
Communication plans: Clear escalation procedures
Recovery procedures: Documented and tested

3. During Incident

Stay calm: Follow established procedures
Communicate: Keep stakeholders informed
Document: Record all actions taken
Verify: Test thoroughly before declaring resolved

4. Post-Incident

Root cause analysis: Understand what happened
Process improvement: Update procedures based on learnings
Team review: Conduct post-mortem meeting
Prevention: Implement measures to prevent recurrence

Tools and Technologies

1. Database Management

MySQL: Native replication, MySQL Router
PostgreSQL: Streaming replication, pgpool-II
MongoDB: Replica sets, sharding
Redis: Sentinel, Cluster mode

2. Monitoring Tools

Prometheus + Grafana: Metrics and visualization
Datadog: APM and infrastructure monitoring
New Relic: Application performance monitoring
PMM: Percona Monitoring and Management

3. Automation Tools

Ansible: Configuration management
Terraform: Infrastructure as code
Kubernetes: Container orchestration
Consul: Service discovery and health checks

Conclusion

Database downtime incidents require swift, coordinated response to minimize business impact. A well-prepared disaster recovery plan with automated failover procedures can significantly reduce recovery time and ensure business continuity.

Key success factors:

Preparation: Regular testing and team training
Automation: Automated failover and health checks
Monitoring: Comprehensive observability
Communication: Clear incident response procedures
Continuous improvement: Learning from each incident

Remember: The goal is not just to recover from incidents, but to prevent them and minimize their impact when they do occur.

Overview​

Common Database Downtime Scenarios​

1. Hardware Failures​

2. Software Issues​

3. Operational Issues​

4. External Dependencies​

Master-Slave Architecture Overview​

Architecture Components​

Replication Types​

Incident Detection and Alerting​

1. Health Check Endpoints​

2. Monitoring Metrics​

3. Alerting Rules​

Disaster Recovery Plan: Master-Slave Switch​

Phase 1: Immediate Response (0-5 minutes)​

1.1 Incident Confirmation​

1.2 Communication​

1.3 Initial Assessment​

Phase 2: Failover Decision (5-15 minutes)​

2.1 Failover Criteria​

2.2 Pre-Failover Checklist​

Phase 3: Failover Execution (15-30 minutes)​

3.1 Stop Application Traffic​

3.2 Promote Slave to Master​

3.3 Update Application Configuration​

3.4 Verify Failover​

Phase 4: Post-Failover (30-60 minutes)​

4.1 Monitor System Health​

4.2 Data Consistency Check​

4.3 Communication Updates​

Phase 5: Recovery Planning (1-24 hours)​

5.1 Root Cause Analysis​

5.2 Recovery Options​

5.3 Setup New Slave​

Automation Scripts​

1. Failover Script​

2. Health Check Script​

Testing and Validation​

1. Failover Testing​

2. Load Testing​

3. Data Integrity Testing​

Monitoring and Alerting​

1. Key Metrics​

2. Alerting Rules​

Best Practices​

1. Prevention​

2. Preparation​

3. During Incident​

4. Post-Incident​

Tools and Technologies​

1. Database Management​

2. Monitoring Tools​

3. Automation Tools​

Conclusion​