Consistency & Consensus

Distributed systems fundamentals from DDIA Chapter 9

📚 Designing Data-Intensive Applications - Martin Kleppmann

The Core Problem

In a distributed system, nodes can fail, networks can partition, and messages can be delayed or lost. Despite this, we often need nodes to agree on something — which node is the leader, whether a transaction should commit, or what value to store.

Consensus is the problem of getting multiple nodes to agree on a single value, even when some nodes fail or messages are lost.

This is harder than it sounds. The FLP impossibility result proves that no deterministic consensus algorithm can guarantee progress in an asynchronous system where even one node can crash.

Consistency Models

Linearizability (Strong Consistency)

The strongest single-object consistency model. Once a write completes, all subsequent reads must see that value or a later one. The system behaves as if there's only one copy of the data.

Intuition: If operation A completes before operation B starts (in real time), then A must appear before B in the total order.
// Linearizable behavior
Client 1: write(x, 1) ────────────────►
Client 2:              read(x) → 1 ───►
Client 3:                     read(x) → 1 (must see 1, not old value)

Cost: Requires coordination. In the presence of network partitions, you must choose between availability and linearizability (CAP theorem).

Use cases: Leader election, distributed locks, unique constraints.

Sequential Consistency

All operations appear in some sequential order that's consistent with the program order of each client. But this order doesn't have to match real-time order.

Key difference from linearizability: A read might return a "stale" value even after a write has completed, as long as the overall order is consistent.
// Sequentially consistent (but not linearizable)
Client 1: write(x, 1) at T=0
Client 2: read(x) → 0 at T=1  // OK! Can still see old value
Client 2: read(x) → 1 at T=2  // Once you see 1, can't go back to 0

Causal Consistency

Operations that are causally related must be seen in the same order by all nodes. Concurrent operations (no causal relationship) can be seen in different orders.

Causality: If A "happened before" B (A→B), then everyone must see A before B. But if A and B are concurrent, different nodes can disagree on their order.
// Causal relationship
Alice: post("Hello")           // A
Bob:   read(Alice's post)      // B (B depends on A)
Bob:   reply("Hi Alice!")      // C (C depends on B, transitively on A)

Everyone must see: A → B → C
But concurrent posts from Carol can appear anywhere.

Advantage: Can be implemented without global coordination, better availability.

Eventual Consistency

The weakest useful guarantee. If no new updates are made, eventually all replicas will converge to the same value. No guarantees about when or what you'll read in the meantime.

Warning: "Eventual" could mean seconds, minutes, or hours. The system provides no bound. You might read arbitrarily stale data.
// Eventually consistent
write(x, 1) to node A
read(x) from node B → might return old value
... time passes, replication happens ...
read(x) from node B → eventually returns 1

Use cases: DNS, caches, social media feeds (where staleness is acceptable).

Consensus Algorithms

What Consensus Must Achieve

Interactive: Raft Leader Election

Click nodes to simulate failures. Watch how the cluster elects a new leader.

N1 Leader
N2 Follower
N3 Follower
N4 Follower
N5 Follower
Cluster initialized with N1 as leader (term 1)

Raft vs Paxos vs Zab

Algorithm Used By Key Idea
Paxos Chubby, Spanner Two-phase: prepare/accept. Notoriously hard to understand.
Raft etcd, Consul, CockroachDB Leader-based. Designed for understandability. Log replication.
Zab ZooKeeper Similar to Raft but predates it. Atomic broadcast.
Viewstamped Replication Academic Predecessor to Raft. View changes for leader election.

Two-Phase Commit (2PC)

A protocol for atomic commits across multiple nodes. Either all nodes commit, or all abort.

┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│ Coordinator │         │  Participant │         │  Participant │
└──────┬──────┘         └──────┬──────┘         └──────┬──────┘
       │                       │                       │
       │──── PREPARE ─────────►│                       │
       │──── PREPARE ──────────────────────────────────►
       │                       │                       │
       │◄──── YES ─────────────│                       │
       │◄──── YES ─────────────────────────────────────│
       │                       │                       │
       │──── COMMIT ──────────►│                       │
       │──── COMMIT ───────────────────────────────────►
       │                       │                       │
       ▼                       ▼                       ▼
        
The 2PC Problem: If the coordinator crashes after sending PREPARE but before sending COMMIT/ABORT, participants are stuck. They've promised to commit but don't know the decision. This is a blocking protocol.

Three-Phase Commit (3PC)

Adds a "pre-commit" phase to make the protocol non-blocking. But it doesn't work correctly with network partitions, so it's rarely used in practice.

Distributed Transactions

The Problem

A transaction spans multiple partitions or databases. We need all-or-nothing semantics across independent systems.

// Transfer $100 from Account A (Bank 1) to Account B (Bank 2)
BEGIN TRANSACTION
  Bank1.debit(A, 100)   // What if this succeeds...
  Bank2.credit(B, 100)  // ...but this fails?
COMMIT

Solutions

Approach Pros Cons
2PC Strong guarantees Blocking, coordinator is SPOF
Saga Non-blocking, good availability Compensating transactions needed, eventual consistency
Avoid Simple, fast Design constraints, may not be possible
Saga Pattern: Break the transaction into steps with compensating actions. If step N fails, run compensations for steps N-1, N-2, ... 1.
1. debit(A, 100)        compensate: credit(A, 100)
2. credit(B, 100)       compensate: debit(B, 100)

If step 2 fails → run compensation 1 → credit(A, 100)

Key Takeaways