Consensus Algorithms: Deep Dive
Consensus algorithms are fundamental to distributed systems, enabling multiple nodes to agree on a single value or state even in the presence of failures. This deep dive explores the most important consensus algorithms, their mechanisms, trade-offs, and real-world applications.
The Consensus Problem
Definition
The consensus problem requires that all non-faulty processes in a distributed system agree on a single value, even when some processes may fail or behave maliciously.
Properties of Consensus
Any consensus algorithm must satisfy these properties:
- Agreement: All non-faulty processes decide on the same value
- Validity: If all processes propose the same value, that value is decided
- Termination: All non-faulty processes eventually decide on some value
Why Consensus is Hard
- Network Partitions: Messages can be delayed, lost, or reordered
- Process Failures: Nodes can crash, become unresponsive, or behave maliciously
- FLP Impossibility: In asynchronous networks, consensus is impossible if even one process can fail
- Timing Assumptions: Real systems require timeouts and failure detectors
Raft Consensus Algorithm
Overview
Raft is designed to be understandable while providing the same guarantees as Paxos. It separates the key elements of consensus: leader election, log replication, and safety.
Key Concepts
1. Server States
Every server in a Raft cluster is in one of three states:
- Follower: Passive, responds to requests from leaders and candidates
- Candidate: Actively seeking votes to become leader
- Leader: Handles all client requests and manages log replication
2. Terms
- Term: Logical time periods, monotonically increasing
- Election Term: Period during which a leader is elected
- Each server maintains current term number
- Terms act as logical clocks to detect stale information
3. Log Structure
Index: 1 2 3 4 5 6 7 8 9 10 11
Term: 1 1 1 2 2 2 3 3 3 3 3
Entry: [x] [y] [z] [a] [b] [c] [d] [e] [f] [g] [h]
Each log entry contains:
- Command: The state machine command
- Term: When entry was created
- Index: Position in the log
Raft Algorithm Components
1. Leader Election
Process:
- Server starts as follower
- If no heartbeat from leader, becomes candidate
- Increments term, votes for self
- Requests votes from other servers
- Becomes leader if receives majority votes
Election Rules:
- Candidate wins if receives votes from majority of servers
- If vote split, new election starts
- Servers vote for at most one candidate per term
- Candidate must have up-to-date log to receive vote
Example Election Scenario:
Initial State: All followers
Server A timeout → becomes candidate (term 2)
Server A → Server B: RequestVote(term=2, candidateId=A)
Server A → Server C: RequestVote(term=2, candidateId=A)
Server B votes for A (first candidate in term 2)
Server C votes for A
Server A receives majority → becomes leader
2. Log Replication
Process:
- Client sends command to leader
- Leader appends entry to local log
- Leader sends AppendEntries to followers
- Leader waits for majority acknowledgment
- Leader commits entry and responds to client
- Leader notifies followers of commitment
AppendEntries RPC:
{
"term": 3,
"leaderId": "server-1",
"prevLogIndex": 5,
"prevLogTerm": 2,
"entries": [
{ "term": 3, "command": "set x=5" },
{ "term": 3, "command": "set y=10" }
],
"leaderCommit": 4
}
Consistency Check:
- Follower checks if prevLogIndex and prevLogTerm match
- If mismatch, follower rejects AppendEntries
- Leader decrements nextIndex and retries
- Eventually finds point where logs match
3. Safety Properties
Election Safety: At most one leader per term Leader Append-Only: Leader never overwrites/deletes entries Log Matching: If two logs contain entry with same index/term, logs are identical up to that point Leader Completeness: If entry committed in given term, it's present in leaders of all higher terms State Machine Safety: If server applies entry at given index, no other server applies different entry at same index
Raft Implementation Example
type RaftNode struct {
// Persistent state
currentTerm int
votedFor string
log []LogEntry
// Volatile state
commitIndex int
lastApplied int
// Leader state
nextIndex map[string]int
matchIndex map[string]int
state NodeState // Follower, Candidate, Leader
peers []string
electionTimeout time.Duration
heartbeatTimeout time.Duration
}
func (rn *RaftNode) RequestVote(args RequestVoteArgs) RequestVoteReply {
reply := RequestVoteReply{Term: rn.currentTerm}
// Reply false if candidate's term is outdated
if args.Term < rn.currentTerm {
reply.VoteGranted = false
return reply
}
// Update term if candidate's term is newer
if args.Term > rn.currentTerm {
rn.currentTerm = args.Term
rn.votedFor = ""
rn.state = Follower
}
// Check if we can vote for this candidate
if (rn.votedFor == "" || rn.votedFor == args.CandidateId) &&
rn.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
rn.votedFor = args.CandidateId
reply.VoteGranted = true
rn.resetElectionTimeout()
}
return reply
}
Raft Optimizations
1. Log Compaction (Snapshotting)
- Periodically create snapshots of state machine
- Discard log entries before snapshot
- Reduces storage and speeds up slow followers
2. Batch Operations
- Batch multiple client requests
- Reduces number of consensus rounds
- Improves throughput
3. Pipeline Replication
- Send multiple AppendEntries without waiting
- Improves latency for log replication
- Requires careful ordering
Paxos Algorithm
Overview
Paxos is a family of protocols for solving consensus in unreliable networks. It's proven to be correct but notoriously difficult to understand and implement.
Basic Paxos
Roles
- Proposer: Proposes values
- Acceptor: Votes on proposed values
- Learner: Learns chosen values
Phases
Phase 1: Prepare
- Proposer chooses proposal number n
- Sends Prepare(n) to majority of acceptors
- Acceptor responds with promise not to accept proposals < n
- If acceptor already accepted proposal, includes that value
Phase 2: Accept
- If proposer receives majority promises, sends Accept(n, v)
- Value v is either from highest-numbered accepted proposal or proposer's value
- Acceptor accepts if n >= highest promised number
- If majority accepts, value is chosen
Example Paxos Execution
Proposer P1 wants to propose value "A"
Proposer P2 wants to propose value "B"
Phase 1:
P1 → Acceptors: Prepare(1)
A1: Promise(1) - haven't accepted anything
A2: Promise(1) - haven't accepted anything
A3: Promise(1) - haven't accepted anything
P2 → Acceptors: Prepare(2)
A1: Promise(2) - haven't accepted anything
A2: Promise(2) - haven't accepted anything
A3: Promise(2) - haven't accepted anything
Phase 2:
P1 → Acceptors: Accept(1, "A") - rejected (promised higher number 2)
P2 → Acceptors: Accept(2, "B") - accepted by majority
Value "B" is chosen
Multi-Paxos
Basic Paxos chooses single value. Multi-Paxos extends this to choose sequence of values:
- Leader Election: Choose distinguished proposer
- Skip Phase 1: Leader can skip prepare phase for subsequent proposals
- Log Ordering: Maintain ordered sequence of chosen values
Paxos Variants
1. Fast Paxos
- Allows clients to send values directly to acceptors
- Reduces latency from 4 to 3 message delays
- Requires larger quorums (3/4 instead of 1/2)
2. Cheap Paxos
- Uses fewer acceptors during normal operation
- Reconfigures to use more acceptors during failures
- Reduces cost while maintaining fault tolerance
3. Byzantine Paxos
- Handles Byzantine (malicious) failures
- Requires 3f+1 replicas to tolerate f Byzantine failures
- Uses digital signatures for authentication
Paxos vs Raft Comparison
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Complex, hard to understand | Simple, designed for clarity |
| Implementation | Difficult, many edge cases | Straightforward |
| Leader Election | No built-in leader election | Strong leader model |
| Log Structure | Flexible, allows gaps | Strongly consistent, no gaps |
| Performance | Can be optimized for throughput | Good balance of performance/simplicity |
| Reconfiguration | Complex | Simpler with joint consensus |
Byzantine Fault Tolerance (BFT)
Problem Definition
Byzantine failures are the most general type of failure, where nodes can:
- Crash or become unresponsive
- Send incorrect or contradictory messages
- Behave maliciously or be compromised
- Collude with other Byzantine nodes
Byzantine Generals Problem
Scenario: Byzantine army with generals surrounding enemy city
- Generals must coordinate attack or retreat
- Some generals may be traitors
- Communication only through messengers
- Need majority agreement despite traitors
Solution Requirements:
- All loyal generals decide on same action
- Small number of traitors cannot cause loyal generals to adopt bad plan
PBFT (Practical Byzantine Fault Tolerance)
System Model
- Assumptions: Partial synchrony, authenticated messages
- Fault Model: Up to f Byzantine failures out of 3f+1 total replicas
- Safety: Never violates safety properties
- Liveness: Eventually makes progress
PBFT Algorithm Phases
Phase 1: Pre-prepare
Primary → Backups: PRE-PREPARE(v, n, m)
v = view number
n = sequence number
m = client request
Phase 2: Prepare
Backup i → All: PREPARE(v, n, d, i)
d = digest of request m
Phase 3: Commit
Replica i → All: COMMIT(v, n, d, i)
After receiving 2f matching PREPARE messages
Execution
After receiving 2f+1 matching COMMIT messages:
- Execute request
- Send reply to client
PBFT Example Execution
Client Request: "transfer $100 from A to B"
Phase 1 - Pre-prepare:
Primary → All Backups: PRE-PREPARE(view=1, seq=5, request)
Phase 2 - Prepare:
Backup1 → All: PREPARE(1, 5, hash(request), 1)
Backup2 → All: PREPARE(1, 5, hash(request), 2)
Backup3 → All: PREPARE(1, 5, hash(request), 3)
Phase 3 - Commit:
Each replica after receiving 2f PREPARE messages:
Replica1 → All: COMMIT(1, 5, hash(request), 1)
Replica2 → All: COMMIT(1, 5, hash(request), 2)
Replica3 → All: COMMIT(1, 5, hash(request), 3)
Execution:
Each replica after receiving 2f+1 COMMIT messages:
- Executes "transfer $100 from A to B"
- Sends reply to client
View Changes
When primary fails:
- Replicas detect primary failure via timeout
- Replicas broadcast VIEW-CHANGE messages
- New primary collects 2f+1 VIEW-CHANGE messages
- New primary broadcasts NEW-VIEW message
- System continues with new primary
Modern BFT Algorithms
1. HotStuff
- Linear Communication: O(n) message complexity
- Responsiveness: Fast during good periods
- Simplicity: Easier to understand than PBFT
- Used By: LibraBFT (Diem blockchain)
2. Tendermint
- Immediate Finality: No forks, immediate transaction finality
- Application Agnostic: Separates consensus from application logic
- Used By: Cosmos blockchain ecosystem
3. SBFT (Scalable Byzantine Fault Tolerance)
- High Throughput: Optimized for performance
- Scalability: Better scaling than PBFT
- Batching: Efficient request batching
Consensus in Practice
etcd (Uses Raft)
# etcd cluster configuration
name: node1
data-dir: /var/lib/etcd
listen-client-urls: http://localhost:2379
advertise-client-urls: http://localhost:2379
listen-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://localhost:2380
initial-cluster: node1=http://localhost:2380,node2=http://node2:2380,node3=http://node3:2380
initial-cluster-state: new
Key Features:
- Strongly consistent key-value store
- Used by Kubernetes for cluster state
- Raft consensus for leader election and log replication
Apache Cassandra (Uses Paxos for Lightweight Transactions)
-- Lightweight transaction using Paxos
INSERT INTO users (id, name, email)
VALUES (123, 'John Doe', 'john@example.com')
IF NOT EXISTS;
-- Conditional update using Paxos
UPDATE users
SET email = 'newemail@example.com'
WHERE id = 123
IF email = 'john@example.com';
Blockchain Consensus
1. Proof of Work (Bitcoin)
- Miners compete to solve cryptographic puzzle
- Longest chain rule for consensus
- Energy intensive but secure
2. Proof of Stake (Ethereum 2.0)
- Validators chosen based on stake
- More energy efficient
- Economic penalties for misbehavior
3. Practical Byzantine Fault Tolerance (Hyperledger Fabric)
- Permissioned network
- Fast finality
- Suitable for enterprise applications
Performance Characteristics
Latency Comparison
| Algorithm | Message Rounds | Latency (ms) | Throughput (ops/sec) |
|---|---|---|---|
| Raft | 2 | 1-5 | 10,000-100,000 |
| Multi-Paxos | 2 | 1-5 | 10,000-100,000 |
| PBFT | 3 | 5-20 | 1,000-10,000 |
| HotStuff | 3 | 3-15 | 5,000-50,000 |
Note: Performance varies significantly based on network conditions, hardware, and implementation quality.
Scalability Limits
Raft/Paxos
- Practical Limit: 5-7 nodes for strong consistency
- Bottleneck: Leader becomes bottleneck
- Network: O(n²) messages for some operations
Byzantine Fault Tolerance
- Practical Limit: 10-20 nodes
- Bottleneck: O(n²) or O(n³) message complexity
- Overhead: Cryptographic operations
Choosing the Right Consensus Algorithm
Decision Matrix
Use Raft When:
- ✅ Need strong consistency
- ✅ Simplicity is important
- ✅ Small cluster size (3-7 nodes)
- ✅ Crash failures only
- ✅ Examples: etcd, HashiCorp Consul, CockroachDB
Use Paxos When:
- ✅ Need proven correctness
- ✅ Complex deployment scenarios
- ✅ Can handle implementation complexity
- ✅ Examples: Google Spanner, Apache Cassandra LWT
Use BFT When:
- ✅ Need Byzantine fault tolerance
- ✅ Untrusted or adversarial environment
- ✅ Can tolerate higher latency/lower throughput
- ✅ Examples: Blockchain, critical infrastructure
Use Eventual Consistency When:
- ✅ High availability is critical
- ✅ Can tolerate temporary inconsistency
- ✅ Global scale deployment
- ✅ Examples: DNS, CDNs, social media feeds
Implementation Considerations
1. Network Partitions
// Partition detection and handling
func (node *RaftNode) handlePartition() {
if node.isPartitioned() {
if node.hasQuorum() {
// Continue as leader/follower
node.continueOperation()
} else {
// Step down and become follower
node.stepDown()
}
}
}
2. Failure Detection
// Heartbeat mechanism
func (node *RaftNode) sendHeartbeats() {
for node.isLeader() {
for _, peer := range node.peers {
go node.sendAppendEntries(peer, true) // empty heartbeat
}
time.Sleep(node.heartbeatInterval)
}
}
3. Log Compaction
// Snapshot creation
func (node *RaftNode) createSnapshot() {
snapshot := Snapshot{
LastIncludedIndex: node.lastApplied,
LastIncludedTerm: node.log[node.lastApplied].Term,
Data: node.stateMachine.serialize(),
}
node.persistSnapshot(snapshot)
node.trimLog(snapshot.LastIncludedIndex)
}
Testing Consensus Implementations
1. Jepsen Testing
; Jepsen test for consensus system
(deftest raft-test
(let [test (assoc tests/noop-test
:name "raft-consensus"
:client (raft-client)
:nemesis (nemesis/partition-random-halves)
:checker (checker/linearizable))]
(is (:valid? (:results (jepsen/run! test))))))
2. Chaos Engineering
- Network partitions
- Node crashes
- Clock skew
- Message delays/reordering
- Byzantine failures
3. Property-Based Testing
# Property: Safety - all nodes agree on committed values
@given(operations=lists(sampled_from(['read', 'write', 'partition'])))
def test_consensus_safety(operations):
cluster = RaftCluster(nodes=5)
for op in operations:
if op == 'write':
cluster.write(random_key(), random_value())
elif op == 'read':
values = cluster.read_all(random_key())
assert all_equal(values), "Nodes disagree on committed value"
elif op == 'partition':
cluster.partition_random()
Future Directions
1. Flexible Paxos
- Decouples leader election from replication
- Allows different quorum sizes for different phases
- Better performance in some scenarios
2. EPaxos (Egalitarian Paxos)
- No distinguished leader
- Commands can be committed in any order
- Better load distribution
3. Raft Extensions
- Parallel Raft: Multiple independent Raft groups
- Multi-Raft: Sharding with multiple Raft groups
- Learner Nodes: Read-only replicas for scaling reads
4. Quantum-Safe Consensus
- Preparing for quantum computing threats
- Post-quantum cryptography integration
- New security models
Conclusion
Consensus algorithms are fundamental building blocks of distributed systems. Understanding their trade-offs helps in:
- Choosing appropriate algorithms for specific use cases
- Implementing reliable distributed systems
- Debugging consensus-related issues
- Optimizing performance based on workload characteristics
The key is matching the algorithm to your specific requirements for consistency, availability, partition tolerance, and performance.
References
Foundational Papers
- Raft: "In Search of an Understandable Consensus Algorithm" - Ongaro & Ousterhout
- Paxos: "The Part-Time Parliament" - Leslie Lamport
- PBFT: "Practical Byzantine Fault Tolerance" - Castro & Liskov
- FLP: "Impossibility of Distributed Consensus with One Faulty Process" - Fischer, Lynch, Paterson
Implementations
- etcd: github.com/etcd-io/etcd
- Hashicorp Raft: github.com/hashicorp/raft
- BFT-SMaRt: github.com/bft-smart/library
Related Wiki Pages
- Distributed Systems Concepts
- etcd Configuration
- ZooKeeper