Consensus Algorithms: Deep Dive

Consensus algorithms are fundamental to distributed systems, enabling multiple nodes to agree on a single value or state even in the presence of failures. This deep dive explores the most important consensus algorithms, their mechanisms, trade-offs, and real-world applications.

The Consensus Problem

Definition

The consensus problem requires that all non-faulty processes in a distributed system agree on a single value, even when some processes may fail or behave maliciously.

Properties of Consensus

Any consensus algorithm must satisfy these properties:

Agreement: All non-faulty processes decide on the same value
Validity: If all processes propose the same value, that value is decided
Termination: All non-faulty processes eventually decide on some value

Why Consensus is Hard

Network Partitions: Messages can be delayed, lost, or reordered
Process Failures: Nodes can crash, become unresponsive, or behave maliciously
FLP Impossibility: In asynchronous networks, consensus is impossible if even one process can fail
Timing Assumptions: Real systems require timeouts and failure detectors

Raft Consensus Algorithm

Overview

Raft is designed to be understandable while providing the same guarantees as Paxos. It separates the key elements of consensus: leader election, log replication, and safety.

Key Concepts

1. Server States

Every server in a Raft cluster is in one of three states:

Follower: Passive, responds to requests from leaders and candidates
Candidate: Actively seeking votes to become leader
Leader: Handles all client requests and manages log replication

2. Terms

Term: Logical time periods, monotonically increasing
Election Term: Period during which a leader is elected
Each server maintains current term number
Terms act as logical clocks to detect stale information

3. Log Structure

Index:  1  2  3  4  5  6  7  8  9  10 11
Term:   1  1  1  2  2  2  3  3  3  3  3
Entry: [x] [y] [z] [a] [b] [c] [d] [e] [f] [g] [h]

Each log entry contains:

Command: The state machine command
Term: When entry was created
Index: Position in the log

Raft Algorithm Components

1. Leader Election

Process:

Server starts as follower
If no heartbeat from leader, becomes candidate
Increments term, votes for self
Requests votes from other servers
Becomes leader if receives majority votes

Election Rules:

Candidate wins if receives votes from majority of servers
If vote split, new election starts
Servers vote for at most one candidate per term
Candidate must have up-to-date log to receive vote

Example Election Scenario:

Initial State: All followers
Server A timeout → becomes candidate (term 2)
Server A → Server B: RequestVote(term=2, candidateId=A)
Server A → Server C: RequestVote(term=2, candidateId=A)

Server B votes for A (first candidate in term 2)
Server C votes for A
Server A receives majority → becomes leader

2. Log Replication

Process:

Client sends command to leader
Leader appends entry to local log
Leader sends AppendEntries to followers
Leader waits for majority acknowledgment
Leader commits entry and responds to client
Leader notifies followers of commitment

AppendEntries RPC:

{
  "term": 3,
  "leaderId": "server-1",
  "prevLogIndex": 5,
  "prevLogTerm": 2,
  "entries": [
    { "term": 3, "command": "set x=5" },
    { "term": 3, "command": "set y=10" }
  ],
  "leaderCommit": 4
}

Consistency Check:

Follower checks if prevLogIndex and prevLogTerm match
If mismatch, follower rejects AppendEntries
Leader decrements nextIndex and retries
Eventually finds point where logs match

3. Safety Properties

Election Safety: At most one leader per term Leader Append-Only: Leader never overwrites/deletes entries Log Matching: If two logs contain entry with same index/term, logs are identical up to that point Leader Completeness: If entry committed in given term, it's present in leaders of all higher terms State Machine Safety: If server applies entry at given index, no other server applies different entry at same index

Raft Implementation Example

type RaftNode struct {
    // Persistent state
    currentTerm int
    votedFor    string
    log         []LogEntry

    // Volatile state
    commitIndex int
    lastApplied int

    // Leader state
    nextIndex   map[string]int
    matchIndex  map[string]int

    state       NodeState // Follower, Candidate, Leader
    peers       []string

    electionTimeout  time.Duration
    heartbeatTimeout time.Duration
}

func (rn *RaftNode) RequestVote(args RequestVoteArgs) RequestVoteReply {
    reply := RequestVoteReply{Term: rn.currentTerm}

    // Reply false if candidate's term is outdated
    if args.Term < rn.currentTerm {
        reply.VoteGranted = false
        return reply
    }

    // Update term if candidate's term is newer
    if args.Term > rn.currentTerm {
        rn.currentTerm = args.Term
        rn.votedFor = ""
        rn.state = Follower
    }

    // Check if we can vote for this candidate
    if (rn.votedFor == "" || rn.votedFor == args.CandidateId) &&
       rn.isLogUpToDate(args.LastLogIndex, args.LastLogTerm) {
        rn.votedFor = args.CandidateId
        reply.VoteGranted = true
        rn.resetElectionTimeout()
    }

    return reply
}

Raft Optimizations

1. Log Compaction (Snapshotting)

Periodically create snapshots of state machine
Discard log entries before snapshot
Reduces storage and speeds up slow followers

2. Batch Operations

Batch multiple client requests
Reduces number of consensus rounds
Improves throughput

3. Pipeline Replication

Send multiple AppendEntries without waiting
Improves latency for log replication
Requires careful ordering

Paxos Algorithm

Overview

Paxos is a family of protocols for solving consensus in unreliable networks. It's proven to be correct but notoriously difficult to understand and implement.

Basic Paxos

Roles

Proposer: Proposes values
Acceptor: Votes on proposed values
Learner: Learns chosen values

Phases

Phase 1: Prepare

Proposer chooses proposal number n
Sends Prepare(n) to majority of acceptors
Acceptor responds with promise not to accept proposals < n
If acceptor already accepted proposal, includes that value

Phase 2: Accept

If proposer receives majority promises, sends Accept(n, v)
Value v is either from highest-numbered accepted proposal or proposer's value
Acceptor accepts if n >= highest promised number
If majority accepts, value is chosen

Example Paxos Execution

Proposer P1 wants to propose value "A"
Proposer P2 wants to propose value "B"

Phase 1:
P1 → Acceptors: Prepare(1)
A1: Promise(1) - haven't accepted anything
A2: Promise(1) - haven't accepted anything
A3: Promise(1) - haven't accepted anything

P2 → Acceptors: Prepare(2)
A1: Promise(2) - haven't accepted anything
A2: Promise(2) - haven't accepted anything
A3: Promise(2) - haven't accepted anything

Phase 2:
P1 → Acceptors: Accept(1, "A") - rejected (promised higher number 2)
P2 → Acceptors: Accept(2, "B") - accepted by majority
Value "B" is chosen

Multi-Paxos

Basic Paxos chooses single value. Multi-Paxos extends this to choose sequence of values:

Leader Election: Choose distinguished proposer
Skip Phase 1: Leader can skip prepare phase for subsequent proposals
Log Ordering: Maintain ordered sequence of chosen values

Paxos Variants

1. Fast Paxos

Allows clients to send values directly to acceptors
Reduces latency from 4 to 3 message delays
Requires larger quorums (3/4 instead of 1/2)

2. Cheap Paxos

Uses fewer acceptors during normal operation
Reconfigures to use more acceptors during failures
Reduces cost while maintaining fault tolerance

3. Byzantine Paxos

Handles Byzantine (malicious) failures
Requires 3f+1 replicas to tolerate f Byzantine failures
Uses digital signatures for authentication

Paxos vs Raft Comparison

Aspect	Paxos	Raft
Understandability	Complex, hard to understand	Simple, designed for clarity
Implementation	Difficult, many edge cases	Straightforward
Leader Election	No built-in leader election	Strong leader model
Log Structure	Flexible, allows gaps	Strongly consistent, no gaps
Performance	Can be optimized for throughput	Good balance of performance/simplicity
Reconfiguration	Complex	Simpler with joint consensus

Byzantine Fault Tolerance (BFT)

Problem Definition

Byzantine failures are the most general type of failure, where nodes can:

Crash or become unresponsive
Send incorrect or contradictory messages
Behave maliciously or be compromised
Collude with other Byzantine nodes

Byzantine Generals Problem

Scenario: Byzantine army with generals surrounding enemy city

Generals must coordinate attack or retreat
Some generals may be traitors
Communication only through messengers
Need majority agreement despite traitors

Solution Requirements:

All loyal generals decide on same action
Small number of traitors cannot cause loyal generals to adopt bad plan

PBFT (Practical Byzantine Fault Tolerance)

System Model

Assumptions: Partial synchrony, authenticated messages
Fault Model: Up to f Byzantine failures out of 3f+1 total replicas
Safety: Never violates safety properties
Liveness: Eventually makes progress

PBFT Algorithm Phases

Phase 1: Pre-prepare

Primary → Backups: PRE-PREPARE(v, n, m)
v = view number
n = sequence number
m = client request

Phase 2: Prepare

Backup i → All: PREPARE(v, n, d, i)
d = digest of request m

Phase 3: Commit

Replica i → All: COMMIT(v, n, d, i)
After receiving 2f matching PREPARE messages

Execution

After receiving 2f+1 matching COMMIT messages:
- Execute request
- Send reply to client

PBFT Example Execution

Client Request: "transfer $100 from A to B"

Phase 1 - Pre-prepare:
Primary → All Backups: PRE-PREPARE(view=1, seq=5, request)

Phase 2 - Prepare:
Backup1 → All: PREPARE(1, 5, hash(request), 1)
Backup2 → All: PREPARE(1, 5, hash(request), 2)
Backup3 → All: PREPARE(1, 5, hash(request), 3)

Phase 3 - Commit:
Each replica after receiving 2f PREPARE messages:
Replica1 → All: COMMIT(1, 5, hash(request), 1)
Replica2 → All: COMMIT(1, 5, hash(request), 2)
Replica3 → All: COMMIT(1, 5, hash(request), 3)

Execution:
Each replica after receiving 2f+1 COMMIT messages:
- Executes "transfer $100 from A to B"
- Sends reply to client

View Changes

When primary fails:

Replicas detect primary failure via timeout
Replicas broadcast VIEW-CHANGE messages
New primary collects 2f+1 VIEW-CHANGE messages
New primary broadcasts NEW-VIEW message
System continues with new primary

Modern BFT Algorithms

1. HotStuff

Linear Communication: O(n) message complexity
Responsiveness: Fast during good periods
Simplicity: Easier to understand than PBFT
Used By: LibraBFT (Diem blockchain)

2. Tendermint

Immediate Finality: No forks, immediate transaction finality
Application Agnostic: Separates consensus from application logic
Used By: Cosmos blockchain ecosystem

3. SBFT (Scalable Byzantine Fault Tolerance)

High Throughput: Optimized for performance
Scalability: Better scaling than PBFT
Batching: Efficient request batching

Consensus in Practice

etcd (Uses Raft)

# etcd cluster configuration
name: node1
data-dir: /var/lib/etcd
listen-client-urls: http://localhost:2379
advertise-client-urls: http://localhost:2379
listen-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://localhost:2380
initial-cluster: node1=http://localhost:2380,node2=http://node2:2380,node3=http://node3:2380
initial-cluster-state: new

Key Features:

Strongly consistent key-value store
Used by Kubernetes for cluster state
Raft consensus for leader election and log replication

Apache Cassandra (Uses Paxos for Lightweight Transactions)

-- Lightweight transaction using Paxos
INSERT INTO users (id, name, email)
VALUES (123, 'John Doe', 'john@example.com')
IF NOT EXISTS;

-- Conditional update using Paxos
UPDATE users
SET email = 'newemail@example.com'
WHERE id = 123
IF email = 'john@example.com';

Blockchain Consensus

1. Proof of Work (Bitcoin)

Miners compete to solve cryptographic puzzle
Longest chain rule for consensus
Energy intensive but secure

2. Proof of Stake (Ethereum 2.0)

Validators chosen based on stake
More energy efficient
Economic penalties for misbehavior

3. Practical Byzantine Fault Tolerance (Hyperledger Fabric)

Permissioned network
Fast finality
Suitable for enterprise applications

Performance Characteristics

Latency Comparison

Algorithm	Message Rounds	Latency (ms)	Throughput (ops/sec)
Raft	2	1-5	10,000-100,000
Multi-Paxos	2	1-5	10,000-100,000
PBFT	3	5-20	1,000-10,000
HotStuff	3	3-15	5,000-50,000

Note: Performance varies significantly based on network conditions, hardware, and implementation quality.

Scalability Limits

Raft/Paxos

Practical Limit: 5-7 nodes for strong consistency
Bottleneck: Leader becomes bottleneck
Network: O(n²) messages for some operations

Byzantine Fault Tolerance

Practical Limit: 10-20 nodes
Bottleneck: O(n²) or O(n³) message complexity
Overhead: Cryptographic operations

Choosing the Right Consensus Algorithm

Decision Matrix

Use Raft When:

✅ Need strong consistency
✅ Simplicity is important
✅ Small cluster size (3-7 nodes)
✅ Crash failures only
✅ Examples: etcd, HashiCorp Consul, CockroachDB

Use Paxos When:

✅ Need proven correctness
✅ Complex deployment scenarios
✅ Can handle implementation complexity
✅ Examples: Google Spanner, Apache Cassandra LWT

Use BFT When:

✅ Need Byzantine fault tolerance
✅ Untrusted or adversarial environment
✅ Can tolerate higher latency/lower throughput
✅ Examples: Blockchain, critical infrastructure

Use Eventual Consistency When:

✅ High availability is critical
✅ Can tolerate temporary inconsistency
✅ Global scale deployment
✅ Examples: DNS, CDNs, social media feeds

Implementation Considerations

1. Network Partitions

// Partition detection and handling
func (node *RaftNode) handlePartition() {
    if node.isPartitioned() {
        if node.hasQuorum() {
            // Continue as leader/follower
            node.continueOperation()
        } else {
            // Step down and become follower
            node.stepDown()
        }
    }
}

2. Failure Detection

// Heartbeat mechanism
func (node *RaftNode) sendHeartbeats() {
    for node.isLeader() {
        for _, peer := range node.peers {
            go node.sendAppendEntries(peer, true) // empty heartbeat
        }
        time.Sleep(node.heartbeatInterval)
    }
}

3. Log Compaction

// Snapshot creation
func (node *RaftNode) createSnapshot() {
    snapshot := Snapshot{
        LastIncludedIndex: node.lastApplied,
        LastIncludedTerm:  node.log[node.lastApplied].Term,
        Data:              node.stateMachine.serialize(),
    }

    node.persistSnapshot(snapshot)
    node.trimLog(snapshot.LastIncludedIndex)
}

Testing Consensus Implementations

1. Jepsen Testing

; Jepsen test for consensus system
(deftest raft-test
  (let [test (assoc tests/noop-test
                   :name "raft-consensus"
                   :client (raft-client)
                   :nemesis (nemesis/partition-random-halves)
                   :checker (checker/linearizable))]
    (is (:valid? (:results (jepsen/run! test))))))

2. Chaos Engineering

Network partitions
Node crashes
Clock skew
Message delays/reordering
Byzantine failures

3. Property-Based Testing

# Property: Safety - all nodes agree on committed values
@given(operations=lists(sampled_from(['read', 'write', 'partition'])))
def test_consensus_safety(operations):
    cluster = RaftCluster(nodes=5)

    for op in operations:
        if op == 'write':
            cluster.write(random_key(), random_value())
        elif op == 'read':
            values = cluster.read_all(random_key())
            assert all_equal(values), "Nodes disagree on committed value"
        elif op == 'partition':
            cluster.partition_random()

Future Directions

1. Flexible Paxos

Decouples leader election from replication
Allows different quorum sizes for different phases
Better performance in some scenarios

2. EPaxos (Egalitarian Paxos)

No distinguished leader
Commands can be committed in any order
Better load distribution

3. Raft Extensions

Parallel Raft: Multiple independent Raft groups
Multi-Raft: Sharding with multiple Raft groups
Learner Nodes: Read-only replicas for scaling reads

4. Quantum-Safe Consensus

Preparing for quantum computing threats
Post-quantum cryptography integration
New security models

Conclusion

Consensus algorithms are fundamental building blocks of distributed systems. Understanding their trade-offs helps in:

Choosing appropriate algorithms for specific use cases
Implementing reliable distributed systems
Debugging consensus-related issues
Optimizing performance based on workload characteristics

The key is matching the algorithm to your specific requirements for consistency, availability, partition tolerance, and performance.

References

Foundational Papers

Raft: "In Search of an Understandable Consensus Algorithm" - Ongaro & Ousterhout
Paxos: "The Part-Time Parliament" - Leslie Lamport
PBFT: "Practical Byzantine Fault Tolerance" - Castro & Liskov
FLP: "Impossibility of Distributed Consensus with One Faulty Process" - Fischer, Lynch, Paterson

Implementations

etcd: github.com/etcd-io/etcd
Hashicorp Raft: github.com/hashicorp/raft
BFT-SMaRt: github.com/bft-smart/library

Distributed Systems Concepts
etcd Configuration
ZooKeeper

The Consensus Problem​

Definition​

Properties of Consensus​

Why Consensus is Hard​

Raft Consensus Algorithm​

Overview​

Key Concepts​

1. Server States​

2. Terms​

3. Log Structure​

Raft Algorithm Components​

1. Leader Election​

2. Log Replication​

3. Safety Properties​

Raft Implementation Example​

Raft Optimizations​

1. Log Compaction (Snapshotting)​

2. Batch Operations​

3. Pipeline Replication​

Paxos Algorithm​

Overview​

Basic Paxos​

Roles​

Phases​

Example Paxos Execution​

Multi-Paxos​

Paxos Variants​

1. Fast Paxos​

2. Cheap Paxos​

3. Byzantine Paxos​

Paxos vs Raft Comparison​

Byzantine Fault Tolerance (BFT)​

Problem Definition​

Byzantine Generals Problem​

PBFT (Practical Byzantine Fault Tolerance)​

System Model​

PBFT Algorithm Phases​

PBFT Example Execution​

View Changes​

Modern BFT Algorithms​

1. HotStuff​

2. Tendermint​

3. SBFT (Scalable Byzantine Fault Tolerance)​

Consensus in Practice​

etcd (Uses Raft)​

Apache Cassandra (Uses Paxos for Lightweight Transactions)​

Blockchain Consensus​

1. Proof of Work (Bitcoin)​

2. Proof of Stake (Ethereum 2.0)​

3. Practical Byzantine Fault Tolerance (Hyperledger Fabric)​

Performance Characteristics​

Latency Comparison​

Scalability Limits​

Raft/Paxos​

Byzantine Fault Tolerance​

Choosing the Right Consensus Algorithm​

Decision Matrix​

Use Raft When:​

Use Paxos When:​

Use BFT When:​

Use Eventual Consistency When:​

Implementation Considerations​

1. Network Partitions​

2. Failure Detection​

3. Log Compaction​

Testing Consensus Implementations​

1. Jepsen Testing​

2. Chaos Engineering​

3. Property-Based Testing​

Future Directions​

1. Flexible Paxos​

2. EPaxos (Egalitarian Paxos)​

3. Raft Extensions​

4. Quantum-Safe Consensus​

Conclusion​

References​

Foundational Papers​

Implementations​

Related Wiki Pages​

The Consensus Problem

Definition

Properties of Consensus

Why Consensus is Hard

Raft Consensus Algorithm

Overview

Key Concepts

1. Server States

2. Terms

3. Log Structure

Raft Algorithm Components

1. Leader Election

2. Log Replication

3. Safety Properties

Raft Implementation Example

Raft Optimizations

1. Log Compaction (Snapshotting)

2. Batch Operations

3. Pipeline Replication

Paxos Algorithm

Overview

Basic Paxos

Roles

Phases

Example Paxos Execution

Multi-Paxos

Paxos Variants

1. Fast Paxos

2. Cheap Paxos

3. Byzantine Paxos

Paxos vs Raft Comparison

Byzantine Fault Tolerance (BFT)

Problem Definition

Byzantine Generals Problem

PBFT (Practical Byzantine Fault Tolerance)

System Model

PBFT Algorithm Phases

PBFT Example Execution

View Changes

Modern BFT Algorithms

1. HotStuff

2. Tendermint

3. SBFT (Scalable Byzantine Fault Tolerance)

Consensus in Practice

etcd (Uses Raft)

Apache Cassandra (Uses Paxos for Lightweight Transactions)

Blockchain Consensus

1. Proof of Work (Bitcoin)

2. Proof of Stake (Ethereum 2.0)

3. Practical Byzantine Fault Tolerance (Hyperledger Fabric)

Performance Characteristics

Latency Comparison

Scalability Limits

Raft/Paxos

Byzantine Fault Tolerance

Choosing the Right Consensus Algorithm

Decision Matrix

Use Raft When:

Use Paxos When:

Use BFT When:

Use Eventual Consistency When:

Implementation Considerations

1. Network Partitions

2. Failure Detection

3. Log Compaction

Testing Consensus Implementations

1. Jepsen Testing

2. Chaos Engineering

3. Property-Based Testing

Future Directions

1. Flexible Paxos

2. EPaxos (Egalitarian Paxos)

3. Raft Extensions

4. Quantum-Safe Consensus

Conclusion

References

Foundational Papers

Implementations

Related Wiki Pages