Database Downtime Incidents & DR Plan: Master-Slave Switch
Overview
Database downtime incidents are among the most critical production issues, often causing complete service unavailability. This document covers database downtime scenarios, incident response procedures, and disaster recovery plans specifically for master-slave database architectures.
Common Database Downtime Scenarios
1. Hardware Failures
- Disk failures: Storage corruption, disk space exhaustion
- Memory issues: RAM failures, memory leaks
- Network problems: Network interface failures, connectivity issues
- Power outages: UPS failures, power supply issues
2. Software Issues
- Database crashes: Segmentation faults, assertion failures
- Deadlocks: Long-running transactions blocking system
- Resource exhaustion: CPU, memory, or I/O saturation
- Configuration errors: Incorrect settings causing instability
3. Operational Issues
- Maintenance gone wrong: Failed upgrades, schema changes
- Human errors: Accidental data deletion, wrong commands
- Backup failures: Corrupted backups, failed restore operations
- Security incidents: Unauthorized access, data breaches
4. External Dependencies
- Cloud provider outages: AWS RDS, Azure SQL, GCP Cloud SQL
- Network connectivity: ISP issues, DNS problems
- Third-party services: Monitoring, backup services down
Master-Slave Architecture Overview
Architecture Components
┌─────────────────┐ ┌─────────────────┐
│ Application │ │ Application │
│ Layer │ │ Layer │
└─────────┬───────┘ └─────────┬───────┘
│ │
│ │
┌─────────▼───────┐ ┌─────────▼───────┐
│ Load Balancer │ │ Load Balancer │
│ / Proxy │ │ / Proxy │
└─────────┬───────┘ └─────────┬───────┘
│ │
└──────────┬───────────┘
│
┌─────────▼───────┐
│ Master DB │
│ (Read/Write) │
└─────────┬───────┘
│
┌─────────▼───────┐
│ Slave DB │
│ (Read Only) │
└─────────────────┘
Replication Types
- Synchronous Replication: Data written to slave before commit
- Asynchronous Replication: Data replicated after commit
- Semi-synchronous Replication: Hybrid approach
Incident Detection and Alerting
1. Health Check Endpoints
# Database connectivity check
curl -f http://db-master:3306/health || echo "Master DB down"
# Replication lag check
mysql -h db-master -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master
# Application-level checks
curl -f http://app:8080/health/db || echo "DB health check failed"
2. Monitoring Metrics
# Key metrics to monitor
database_metrics:
- connection_count
- query_response_time
- replication_lag
- disk_usage
- memory_usage
- cpu_usage
- deadlock_count
- slow_query_count
3. Alerting Rules
# Example Prometheus alerting rules
groups:
- name: database.rules
rules:
- alert: DatabaseDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database instance is down"
- alert: HighReplicationLag
expr: mysql_slave_lag_seconds > 60
for: 2m
labels:
severity: warning
annotations:
summary: "High replication lag detected"
Disaster Recovery Plan: Master-Slave Switch
Phase 1: Immediate Response (0-5 minutes)
1.1 Incident Confirmation
# Verify master database status
mysql -h db-master -e "SELECT 1" || echo "Master DB confirmed down"
# Check application impact
curl -f http://app:8080/health || echo "Application affected"
# Verify slave database status
mysql -h db-slave -e "SELECT 1" || echo "Slave DB status check"
1.2 Communication
- Internal: Notify on-call team, engineering leads
- External: Update status page, notify customers if needed
- Documentation: Start incident timeline
1.3 Initial Assessment
# Check replication status
mysql -h db-slave -e "SHOW SLAVE STATUS\G"
# Verify data consistency
mysql -h db-slave -e "SELECT COUNT(*) FROM critical_table"
# Check application logs
tail -f /var/log/app/error.log | grep -i "database\|connection"
Phase 2: Failover Decision (5-15 minutes)
2.1 Failover Criteria
Immediate Failover Required:
- Master database completely unreachable
- Data corruption detected
- Security breach confirmed
- Hardware failure confirmed
Delayed Failover (Investigate First):
- Network connectivity issues
- High load causing timeouts
- Replication lag issues
- Performance degradation
2.2 Pre-Failover Checklist
# Verify slave is up-to-date
mysql -h db-slave -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master
# Check for any running transactions
mysql -h db-slave -e "SHOW PROCESSLIST"
# Verify backup availability
ls -la /backup/latest/
# Confirm application can connect to slave
mysql -h db-slave -e "SELECT 1"
Phase 3: Failover Execution (15-30 minutes)
3.1 Stop Application Traffic
# Update load balancer to stop routing to master
# Method depends on your load balancer (HAProxy, Nginx, AWS ALB, etc.)
# Example for HAProxy
echo "disable server db-master/db-master" | socat stdio /var/run/haproxy/admin.sock
# Example for Nginx
# Comment out master in upstream configuration
# Reload nginx: nginx -s reload
3.2 Promote Slave to Master
-- Stop replication on slave
STOP SLAVE;
-- Reset slave configuration
RESET SLAVE ALL;
-- Promote to master (MySQL 8.0+)
SET GLOBAL read_only = OFF;
-- For older MySQL versions
-- UNLOCK TABLES;
3.3 Update Application Configuration
# Update database connection strings
# Method depends on your configuration management
# Example: Update environment variables
export DB_HOST=db-slave
export DB_MASTER_HOST=db-slave
# Restart application services
systemctl restart app-service
3.4 Verify Failover
# Test database connectivity
mysql -h db-slave -e "SELECT 1"
# Test application functionality
curl -f http://app:8080/health/db
# Verify critical operations
curl -X POST http://app:8080/api/test-write
curl -f http://app:8080/api/test-read
Phase 4: Post-Failover (30-60 minutes)
4.1 Monitor System Health
# Monitor application performance
watch -n 5 'curl -s http://app:8080/metrics | grep db_'
# Check database performance
mysql -h db-slave -e "SHOW PROCESSLIST"
mysql -h db-slave -e "SHOW ENGINE INNODB STATUS\G"
4.2 Data Consistency Check
-- Compare critical table counts
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT COUNT(*) FROM transactions;
-- Check for any data integrity issues
SELECT * FROM users WHERE created_at > NOW() - INTERVAL 1 HOUR;
4.3 Communication Updates
- Update status page with progress
- Notify stakeholders of successful failover
- Document lessons learned
Phase 5: Recovery Planning (1-24 hours)
5.1 Root Cause Analysis
# Collect logs from failed master
scp db-master:/var/log/mysql/error.log ./master-error.log
# Analyze system logs
journalctl -u mysql --since "1 hour ago" > mysql-system.log
# Check hardware logs
dmesg | grep -i error > hardware-errors.log
5.2 Recovery Options
Option 1: Repair and Rebuild Master
# If hardware issue, replace hardware
# If software issue, reinstall and restore from backup
# Reconfigure as new slave
Option 2: Keep Current Setup
# If failover was successful and stable
# Keep current slave as new master
# Set up new slave from current master
5.3 Setup New Slave
# Create backup of current master (former slave)
mysqldump --single-transaction --routines --triggers --all-databases > master-backup.sql
# Set up new slave server
# Restore backup to new slave
mysql < master-backup.sql
# Configure replication
CHANGE MASTER TO
MASTER_HOST='current-master',
MASTER_USER='replication_user',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=12345;
START SLAVE;
Automation Scripts
1. Failover Script
#!/bin/bash
# failover.sh - Automated master-slave failover
set -e
MASTER_HOST="db-master"
SLAVE_HOST="db-slave"
APP_HOST="app-server"
echo "Starting failover process..."
# Check slave status
echo "Checking slave status..."
mysql -h $SLAVE_HOST -e "SHOW SLAVE STATUS\G" | grep -E "(Slave_IO_Running|Slave_SQL_Running)"
# Stop replication
echo "Stopping replication..."
mysql -h $SLAVE_HOST -e "STOP SLAVE;"
# Promote slave to master
echo "Promoting slave to master..."
mysql -h $SLAVE_HOST -e "RESET SLAVE ALL;"
mysql -h $SLAVE_HOST -e "SET GLOBAL read_only = OFF;"
# Update application configuration
echo "Updating application configuration..."
ssh $APP_HOST "export DB_HOST=$SLAVE_HOST && systemctl restart app-service"
# Verify failover
echo "Verifying failover..."
sleep 10
curl -f http://$APP_HOST:8080/health/db || exit 1
echo "Failover completed successfully!"
2. Health Check Script
#!/bin/bash
# health-check.sh - Database health monitoring
MASTER_HOST="db-master"
SLAVE_HOST="db-slave"
check_database() {
local host=$1
local name=$2
echo "Checking $name ($host)..."
if mysql -h $host -e "SELECT 1" > /dev/null 2>&1; then
echo "✓ $name is accessible"
# Check replication lag for slave
if [ "$name" = "Slave" ]; then
lag=$(mysql -h $host -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
if [ "$lag" != "NULL" ] && [ "$lag" -lt 60 ]; then
echo "✓ Replication lag: ${lag}s"
else
echo "⚠ High replication lag: ${lag}s"
fi
fi
else
echo "✗ $name is not accessible"
return 1
fi
}
check_database $MASTER_HOST "Master"
check_database $SLAVE_HOST "Slave"
Testing and Validation
1. Failover Testing
# Schedule regular failover tests
# Test during maintenance windows
# Simulate master failure
iptables -A INPUT -p tcp --dport 3306 -j DROP
# Execute failover procedure
./failover.sh
# Verify application functionality
./test-application.sh
# Restore master
iptables -D INPUT -p tcp --dport 3306 -j DROP
2. Load Testing
# Test application under load after failover
ab -n 10000 -c 100 http://app:8080/api/read-heavy-endpoint
ab -n 1000 -c 10 -p data.json -T application/json http://app:8080/api/write-endpoint
3. Data Integrity Testing
-- Compare data between master and slave
SELECT
'users' as table_name,
(SELECT COUNT(*) FROM users) as master_count,
(SELECT COUNT(*) FROM slave.users) as slave_count
UNION ALL
SELECT
'orders' as table_name,
(SELECT COUNT(*) FROM orders) as master_count,
(SELECT COUNT(*) FROM slave.orders) as slave_count;
Monitoring and Alerting
1. Key Metrics
# Database metrics to monitor
database_health:
- connection_count
- query_response_time
- replication_lag
- disk_usage
- memory_usage
- deadlock_count
- slow_query_count
- backup_status
- failover_readiness
2. Alerting Rules
# Critical alerts
critical_alerts:
- database_down
- replication_stopped
- high_replication_lag
- disk_space_critical
- memory_exhaustion
# Warning alerts
warning_alerts:
- slow_queries_increase
- connection_count_high
- backup_failed
- disk_usage_high
Best Practices
1. Prevention
- Regular backups: Automated, tested backups
- Monitoring: Comprehensive database monitoring
- Capacity planning: Proactive resource management
- Security: Regular security updates and audits
- Documentation: Keep runbooks updated
2. Preparation
- Regular testing: Monthly failover drills
- Team training: Ensure team knows procedures
- Communication plans: Clear escalation procedures
- Recovery procedures: Documented and tested
3. During Incident
- Stay calm: Follow established procedures
- Communicate: Keep stakeholders informed
- Document: Record all actions taken
- Verify: Test thoroughly before declaring resolved
4. Post-Incident
- Root cause analysis: Understand what happened
- Process improvement: Update procedures based on learnings
- Team review: Conduct post-mortem meeting
- Prevention: Implement measures to prevent recurrence
Tools and Technologies
1. Database Management
- MySQL: Native replication, MySQL Router
- PostgreSQL: Streaming replication, pgpool-II
- MongoDB: Replica sets, sharding
- Redis: Sentinel, Cluster mode
2. Monitoring Tools
- Prometheus + Grafana: Metrics and visualization
- Datadog: APM and infrastructure monitoring
- New Relic: Application performance monitoring
- PMM: Percona Monitoring and Management
3. Automation Tools
- Ansible: Configuration management
- Terraform: Infrastructure as code
- Kubernetes: Container orchestration
- Consul: Service discovery and health checks
Conclusion
Database downtime incidents require swift, coordinated response to minimize business impact. A well-prepared disaster recovery plan with automated failover procedures can significantly reduce recovery time and ensure business continuity.
Key success factors:
- Preparation: Regular testing and team training
- Automation: Automated failover and health checks
- Monitoring: Comprehensive observability
- Communication: Clear incident response procedures
- Continuous improvement: Learning from each incident
Remember: The goal is not just to recover from incidents, but to prevent them and minimize their impact when they do occur.