Skip to main content

Consensus Algorithms: Deep Dive

Consensus algorithms are fundamental to distributed systems, enabling multiple nodes to agree on a single value or state even in the presence of failures. This deep dive explores the most important consensus algorithms, their mechanisms, trade-offs, and real-world applications.

The Consensus Problem

Definition

The consensus problem requires that all non-faulty processes in a distributed system agree on a single value, even when some processes may fail or behave maliciously.

Properties of Consensus

Any consensus algorithm must satisfy these properties:

  1. Agreement: All non-faulty processes decide on the same value
  2. Validity: If all processes propose the same value, that value is decided
  3. Termination: All non-faulty processes eventually decide on some value

Why Consensus is Hard

  • Network Partitions: Messages can be delayed, lost, or reordered
  • Process Failures: Nodes can crash, become unresponsive, or behave maliciously
  • FLP Impossibility: In asynchronous networks, consensus is impossible if even one process can fail
  • Timing Assumptions: Real systems require timeouts and failure detectors

Raft Consensus Algorithm

Overview

Raft is designed to be understandable while providing the same guarantees as Paxos. It separates the key elements of consensus: leader election, log replication, and safety.

Key Concepts

1. Server States

Every server in a Raft cluster is in one of three states:

  • Follower: Passive, responds to requests from leaders and candidates
  • Candidate: Actively seeking votes to become leader
  • Leader: Handles all client requests and manages log replication

2. Terms

  • Term: Logical time periods, monotonically increasing
  • Election Term: Period during which a leader is elected
  • Each server maintains current term number
  • Terms act as logical clocks to detect stale information

3. Log Structure

Index:  1  2  3  4  5  6  7  8  9  10 11
Term: 1 1 1 2 2 2 3 3 3 3 3
Entry: [x] [y] [z] [a] [b] [c] [d] [e] [f] [g] [h]

Each log entry contains:

  • Command: The state machine command
  • Term: When entry was created
  • Index: Position in the log

Raft Algorithm Components

1. Leader Election

Process:

  1. Server starts as follower
  2. If no heartbeat from leader, becomes candidate
  3. Increments term, votes for self
  4. Requests votes from other servers
  5. Becomes leader if receives majority votes

Election Rules:

  • Candidate wins if receives votes from majority of servers
  • If vote split, new election starts
  • Servers vote for at most one candidate per term
  • Candidate must have up-to-date log to receive vote

Example Election Scenario:

Initial State: All followers
Server A timeout → becomes candidate (term 2)
Server A → Server B: RequestVote(term=2, candidateId=A)
Server A → Server C: RequestVote(term=2, candidateId=A)

Server B votes for A (first candidate in term 2)
Server C votes for A
Server A receives majority → becomes leader

2. Log Replication

Process:

  1. Client sends command to leader
  2. Leader appends entry to local log
  3. Leader sends AppendEntries to followers
  4. Leader waits for majority acknowledgment
  5. Leader commits entry and responds to client
  6. Leader notifies followers of commitment

AppendEntries RPC:

{
"term": 3,
"leaderId": "server-1",
"prevLogIndex": 5,
"prevLogTerm": 2,
"entries": [
{ "term": 3, "command": "set x=5" },
{ "term": 3, "command": "set y=10" }
],
"leaderCommit": 4
}

Consistency Check:

  • Follower checks if prevLogIndex and prevLogTerm match
  • If mismatch, follower rejects AppendEntries
  • Leader decrements nextIndex and retries
  • Eventually finds point where logs match

3. Safety Properties

Election Safety: At most one leader per term Leader Append-Only: Leader never overwrites/deletes entries Log Matching: If two logs contain entry with same index/term, logs are identical up to that point Leader Completeness: If entry committed in given term, it's present in leaders of all higher terms State Machine Safety: If server applies entry at given index, no other server applies different entry at same index

Raft Implementation Example

type RaftNode struct {
// Persistent state
currentTerm int
votedFor string
log []LogEntry

// Volatile state
commitIndex int
lastApplied int

// Leader state
nextIndex map[string]int
matchIndex map[string]int

state NodeState // Follower, Candidate, Leader
peers []string

electionTimeout time.Duration
heartbeatTimeout time.Duration
}

func (rn *RaftNode) RequestVote(args RequestVoteArgs) RequestVoteReply {
reply := RequestVoteReply{Term: rn.currentTerm}

// Reply false if candidate's term is outdated
if args.Term < rn.currentTerm {
reply.VoteGranted = false
return reply
}

// Update term if candidate's term is newer
if args.Term > rn.currentTerm {
rn.currentTerm = args.Term
rn.votedFor = ""
rn.state = Follower
}

// Check if we can vote for this candidate
if (rn.votedFor == "" || rn.votedFor == args.CandidateId) &&
rn.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
rn.votedFor = args.CandidateId
reply.VoteGranted = true
rn.resetElectionTimeout()
}

return reply
}

Raft Optimizations

1. Log Compaction (Snapshotting)

  • Periodically create snapshots of state machine
  • Discard log entries before snapshot
  • Reduces storage and speeds up slow followers

2. Batch Operations

  • Batch multiple client requests
  • Reduces number of consensus rounds
  • Improves throughput

3. Pipeline Replication

  • Send multiple AppendEntries without waiting
  • Improves latency for log replication
  • Requires careful ordering

Paxos Algorithm

Overview

Paxos is a family of protocols for solving consensus in unreliable networks. It's proven to be correct but notoriously difficult to understand and implement.

Basic Paxos

Roles

  • Proposer: Proposes values
  • Acceptor: Votes on proposed values
  • Learner: Learns chosen values

Phases

Phase 1: Prepare

  1. Proposer chooses proposal number n
  2. Sends Prepare(n) to majority of acceptors
  3. Acceptor responds with promise not to accept proposals < n
  4. If acceptor already accepted proposal, includes that value

Phase 2: Accept

  1. If proposer receives majority promises, sends Accept(n, v)
  2. Value v is either from highest-numbered accepted proposal or proposer's value
  3. Acceptor accepts if n >= highest promised number
  4. If majority accepts, value is chosen

Example Paxos Execution

Proposer P1 wants to propose value "A"
Proposer P2 wants to propose value "B"

Phase 1:
P1 → Acceptors: Prepare(1)
A1: Promise(1) - haven't accepted anything
A2: Promise(1) - haven't accepted anything
A3: Promise(1) - haven't accepted anything

P2 → Acceptors: Prepare(2)
A1: Promise(2) - haven't accepted anything
A2: Promise(2) - haven't accepted anything
A3: Promise(2) - haven't accepted anything

Phase 2:
P1 → Acceptors: Accept(1, "A") - rejected (promised higher number 2)
P2 → Acceptors: Accept(2, "B") - accepted by majority
Value "B" is chosen

Multi-Paxos

Basic Paxos chooses single value. Multi-Paxos extends this to choose sequence of values:

  1. Leader Election: Choose distinguished proposer
  2. Skip Phase 1: Leader can skip prepare phase for subsequent proposals
  3. Log Ordering: Maintain ordered sequence of chosen values

Paxos Variants

1. Fast Paxos

  • Allows clients to send values directly to acceptors
  • Reduces latency from 4 to 3 message delays
  • Requires larger quorums (3/4 instead of 1/2)

2. Cheap Paxos

  • Uses fewer acceptors during normal operation
  • Reconfigures to use more acceptors during failures
  • Reduces cost while maintaining fault tolerance

3. Byzantine Paxos

  • Handles Byzantine (malicious) failures
  • Requires 3f+1 replicas to tolerate f Byzantine failures
  • Uses digital signatures for authentication

Paxos vs Raft Comparison

AspectPaxosRaft
UnderstandabilityComplex, hard to understandSimple, designed for clarity
ImplementationDifficult, many edge casesStraightforward
Leader ElectionNo built-in leader electionStrong leader model
Log StructureFlexible, allows gapsStrongly consistent, no gaps
PerformanceCan be optimized for throughputGood balance of performance/simplicity
ReconfigurationComplexSimpler with joint consensus

Byzantine Fault Tolerance (BFT)

Problem Definition

Byzantine failures are the most general type of failure, where nodes can:

  • Crash or become unresponsive
  • Send incorrect or contradictory messages
  • Behave maliciously or be compromised
  • Collude with other Byzantine nodes

Byzantine Generals Problem

Scenario: Byzantine army with generals surrounding enemy city

  • Generals must coordinate attack or retreat
  • Some generals may be traitors
  • Communication only through messengers
  • Need majority agreement despite traitors

Solution Requirements:

  • All loyal generals decide on same action
  • Small number of traitors cannot cause loyal generals to adopt bad plan

PBFT (Practical Byzantine Fault Tolerance)

System Model

  • Assumptions: Partial synchrony, authenticated messages
  • Fault Model: Up to f Byzantine failures out of 3f+1 total replicas
  • Safety: Never violates safety properties
  • Liveness: Eventually makes progress

PBFT Algorithm Phases

Phase 1: Pre-prepare

Primary → Backups: PRE-PREPARE(v, n, m)
v = view number
n = sequence number
m = client request

Phase 2: Prepare

Backup i → All: PREPARE(v, n, d, i)
d = digest of request m

Phase 3: Commit

Replica i → All: COMMIT(v, n, d, i)
After receiving 2f matching PREPARE messages

Execution

After receiving 2f+1 matching COMMIT messages:
- Execute request
- Send reply to client

PBFT Example Execution

Client Request: "transfer $100 from A to B"

Phase 1 - Pre-prepare:
Primary → All Backups: PRE-PREPARE(view=1, seq=5, request)

Phase 2 - Prepare:
Backup1 → All: PREPARE(1, 5, hash(request), 1)
Backup2 → All: PREPARE(1, 5, hash(request), 2)
Backup3 → All: PREPARE(1, 5, hash(request), 3)

Phase 3 - Commit:
Each replica after receiving 2f PREPARE messages:
Replica1 → All: COMMIT(1, 5, hash(request), 1)
Replica2 → All: COMMIT(1, 5, hash(request), 2)
Replica3 → All: COMMIT(1, 5, hash(request), 3)

Execution:
Each replica after receiving 2f+1 COMMIT messages:
- Executes "transfer $100 from A to B"
- Sends reply to client

View Changes

When primary fails:

  1. Replicas detect primary failure via timeout
  2. Replicas broadcast VIEW-CHANGE messages
  3. New primary collects 2f+1 VIEW-CHANGE messages
  4. New primary broadcasts NEW-VIEW message
  5. System continues with new primary

Modern BFT Algorithms

1. HotStuff

  • Linear Communication: O(n) message complexity
  • Responsiveness: Fast during good periods
  • Simplicity: Easier to understand than PBFT
  • Used By: LibraBFT (Diem blockchain)

2. Tendermint

  • Immediate Finality: No forks, immediate transaction finality
  • Application Agnostic: Separates consensus from application logic
  • Used By: Cosmos blockchain ecosystem

3. SBFT (Scalable Byzantine Fault Tolerance)

  • High Throughput: Optimized for performance
  • Scalability: Better scaling than PBFT
  • Batching: Efficient request batching

Consensus in Practice

etcd (Uses Raft)

# etcd cluster configuration
name: node1
data-dir: /var/lib/etcd
listen-client-urls: http://localhost:2379
advertise-client-urls: http://localhost:2379
listen-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://localhost:2380
initial-cluster: node1=http://localhost:2380,node2=http://node2:2380,node3=http://node3:2380
initial-cluster-state: new

Key Features:

  • Strongly consistent key-value store
  • Used by Kubernetes for cluster state
  • Raft consensus for leader election and log replication

Apache Cassandra (Uses Paxos for Lightweight Transactions)

-- Lightweight transaction using Paxos
INSERT INTO users (id, name, email)
VALUES (123, 'John Doe', 'john@example.com')
IF NOT EXISTS;

-- Conditional update using Paxos
UPDATE users
SET email = 'newemail@example.com'
WHERE id = 123
IF email = 'john@example.com';

Blockchain Consensus

1. Proof of Work (Bitcoin)

  • Miners compete to solve cryptographic puzzle
  • Longest chain rule for consensus
  • Energy intensive but secure

2. Proof of Stake (Ethereum 2.0)

  • Validators chosen based on stake
  • More energy efficient
  • Economic penalties for misbehavior

3. Practical Byzantine Fault Tolerance (Hyperledger Fabric)

  • Permissioned network
  • Fast finality
  • Suitable for enterprise applications

Performance Characteristics

Latency Comparison

AlgorithmMessage RoundsLatency (ms)Throughput (ops/sec)
Raft21-510,000-100,000
Multi-Paxos21-510,000-100,000
PBFT35-201,000-10,000
HotStuff33-155,000-50,000

Note: Performance varies significantly based on network conditions, hardware, and implementation quality.

Scalability Limits

Raft/Paxos

  • Practical Limit: 5-7 nodes for strong consistency
  • Bottleneck: Leader becomes bottleneck
  • Network: O(n²) messages for some operations

Byzantine Fault Tolerance

  • Practical Limit: 10-20 nodes
  • Bottleneck: O(n²) or O(n³) message complexity
  • Overhead: Cryptographic operations

Choosing the Right Consensus Algorithm

Decision Matrix

Use Raft When:

  • ✅ Need strong consistency
  • ✅ Simplicity is important
  • ✅ Small cluster size (3-7 nodes)
  • ✅ Crash failures only
  • ✅ Examples: etcd, HashiCorp Consul, CockroachDB

Use Paxos When:

  • ✅ Need proven correctness
  • ✅ Complex deployment scenarios
  • ✅ Can handle implementation complexity
  • ✅ Examples: Google Spanner, Apache Cassandra LWT

Use BFT When:

  • ✅ Need Byzantine fault tolerance
  • ✅ Untrusted or adversarial environment
  • ✅ Can tolerate higher latency/lower throughput
  • ✅ Examples: Blockchain, critical infrastructure

Use Eventual Consistency When:

  • ✅ High availability is critical
  • ✅ Can tolerate temporary inconsistency
  • ✅ Global scale deployment
  • ✅ Examples: DNS, CDNs, social media feeds

Implementation Considerations

1. Network Partitions

// Partition detection and handling
func (node *RaftNode) handlePartition() {
if node.isPartitioned() {
if node.hasQuorum() {
// Continue as leader/follower
node.continueOperation()
} else {
// Step down and become follower
node.stepDown()
}
}
}

2. Failure Detection

// Heartbeat mechanism
func (node *RaftNode) sendHeartbeats() {
for node.isLeader() {
for _, peer := range node.peers {
go node.sendAppendEntries(peer, true) // empty heartbeat
}
time.Sleep(node.heartbeatInterval)
}
}

3. Log Compaction

// Snapshot creation
func (node *RaftNode) createSnapshot() {
snapshot := Snapshot{
LastIncludedIndex: node.lastApplied,
LastIncludedTerm: node.log[node.lastApplied].Term,
Data: node.stateMachine.serialize(),
}

node.persistSnapshot(snapshot)
node.trimLog(snapshot.LastIncludedIndex)
}

Testing Consensus Implementations

1. Jepsen Testing

; Jepsen test for consensus system
(deftest raft-test
(let [test (assoc tests/noop-test
:name "raft-consensus"
:client (raft-client)
:nemesis (nemesis/partition-random-halves)
:checker (checker/linearizable))]
(is (:valid? (:results (jepsen/run! test))))))

2. Chaos Engineering

  • Network partitions
  • Node crashes
  • Clock skew
  • Message delays/reordering
  • Byzantine failures

3. Property-Based Testing

# Property: Safety - all nodes agree on committed values
@given(operations=lists(sampled_from(['read', 'write', 'partition'])))
def test_consensus_safety(operations):
cluster = RaftCluster(nodes=5)

for op in operations:
if op == 'write':
cluster.write(random_key(), random_value())
elif op == 'read':
values = cluster.read_all(random_key())
assert all_equal(values), "Nodes disagree on committed value"
elif op == 'partition':
cluster.partition_random()

Future Directions

1. Flexible Paxos

  • Decouples leader election from replication
  • Allows different quorum sizes for different phases
  • Better performance in some scenarios

2. EPaxos (Egalitarian Paxos)

  • No distinguished leader
  • Commands can be committed in any order
  • Better load distribution

3. Raft Extensions

  • Parallel Raft: Multiple independent Raft groups
  • Multi-Raft: Sharding with multiple Raft groups
  • Learner Nodes: Read-only replicas for scaling reads

4. Quantum-Safe Consensus

  • Preparing for quantum computing threats
  • Post-quantum cryptography integration
  • New security models

Conclusion

Consensus algorithms are fundamental building blocks of distributed systems. Understanding their trade-offs helps in:

  • Choosing appropriate algorithms for specific use cases
  • Implementing reliable distributed systems
  • Debugging consensus-related issues
  • Optimizing performance based on workload characteristics

The key is matching the algorithm to your specific requirements for consistency, availability, partition tolerance, and performance.

References

Foundational Papers

  • Raft: "In Search of an Understandable Consensus Algorithm" - Ongaro & Ousterhout
  • Paxos: "The Part-Time Parliament" - Leslie Lamport
  • PBFT: "Practical Byzantine Fault Tolerance" - Castro & Liskov
  • FLP: "Impossibility of Distributed Consensus with One Faulty Process" - Fischer, Lynch, Paterson

Implementations

  • Distributed Systems Concepts
  • etcd Configuration
  • ZooKeeper