Consistency & Consensus

Distributed systems fundamentals from DDIA Chapter 9

📚 Designing Data-Intensive Applications - Martin Kleppmann

The Core Problem

In a distributed system, nodes can fail, networks can partition, and messages can be delayed or lost. Despite this, we often need nodes to agree on something — which node is the leader, whether a transaction should commit, or what value to store.

        Consensus is the problem of getting multiple nodes to agree on a single value, 
        even when some nodes fail or messages are lost.
      

This is harder than it sounds. The FLP impossibility result proves that no deterministic consensus algorithm can guarantee progress in an asynchronous system where even one node can crash.

Consistency Models

Linearizability (Strong Consistency)

The strongest single-object consistency model. Once a write completes, all subsequent reads must see that value or a later one. The system behaves as if there's only one copy of the data.

          Intuition: If operation A completes before operation B starts (in real time), 
          then A must appear before B in the total order.
        

// Linearizable behavior
Client 1: write(x, 1) ────────────────►
Client 2:              read(x) → 1 ───►
Client 3:                     read(x) → 1 (must see 1, not old value)

Cost: Requires coordination. In the presence of network partitions, you must choose between availability and linearizability (CAP theorem).

Use cases: Leader election, distributed locks, unique constraints.

Sequential Consistency

All operations appear in some sequential order that's consistent with the program order of each client. But this order doesn't have to match real-time order.

          Key difference from linearizability: A read might return a "stale" value 
          even after a write has completed, as long as the overall order is consistent.
        

// Sequentially consistent (but not linearizable)
Client 1: write(x, 1) at T=0
Client 2: read(x) → 0 at T=1  // OK! Can still see old value
Client 2: read(x) → 1 at T=2  // Once you see 1, can't go back to 0

Causal Consistency

Operations that are causally related must be seen in the same order by all nodes. Concurrent operations (no causal relationship) can be seen in different orders.

          Causality: If A "happened before" B (A→B), then everyone must see A before B. 
          But if A and B are concurrent, different nodes can disagree on their order.
        

// Causal relationship
Alice: post("Hello")           // A
Bob:   read(Alice's post)      // B (B depends on A)
Bob:   reply("Hi Alice!")      // C (C depends on B, transitively on A)

Everyone must see: A → B → C
But concurrent posts from Carol can appear anywhere.

Advantage: Can be implemented without global coordination, better availability.

Eventual Consistency

The weakest useful guarantee. If no new updates are made, eventually all replicas will converge to the same value. No guarantees about when or what you'll read in the meantime.

          Warning: "Eventual" could mean seconds, minutes, or hours. The system 
          provides no bound. You might read arbitrarily stale data.
        

// Eventually consistent
write(x, 1) to node A
read(x) from node B → might return old value
... time passes, replication happens ...
read(x) from node B → eventually returns 1

Use cases: DNS, caches, social media feeds (where staleness is acceptable).

Consensus Algorithms

What Consensus Must Achieve

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value was proposed by some node
Termination: All non-faulty nodes eventually decide
Integrity: Each node decides at most once

Interactive: Raft Leader Election

Click nodes to simulate failures. Watch how the cluster elects a new leader.

N1 Leader

N2 Follower

N3 Follower

N4 Follower

N5 Follower

Cluster initialized with N1 as leader (term 1)

Raft vs Paxos vs Zab

Algorithm	Used By	Key Idea
Paxos	Chubby, Spanner	Two-phase: prepare/accept. Notoriously hard to understand.
Raft	etcd, Consul, CockroachDB	Leader-based. Designed for understandability. Log replication.
Zab	ZooKeeper	Similar to Raft but predates it. Atomic broadcast.
Viewstamped Replication	Academic	Predecessor to Raft. View changes for leader election.

Two-Phase Commit (2PC)

A protocol for atomic commits across multiple nodes. Either all nodes commit, or all abort.

┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│ Coordinator │         │  Participant │         │  Participant │
└──────┬──────┘         └──────┬──────┘         └──────┬──────┘
       │                       │                       │
       │──── PREPARE ─────────►│                       │
       │──── PREPARE ──────────────────────────────────►
       │                       │                       │
       │◄──── YES ─────────────│                       │
       │◄──── YES ─────────────────────────────────────│
       │                       │                       │
       │──── COMMIT ──────────►│                       │
       │──── COMMIT ───────────────────────────────────►
       │                       │                       │
       ▼                       ▼                       ▼

        The 2PC Problem: If the coordinator crashes after sending PREPARE but 
        before sending COMMIT/ABORT, participants are stuck. They've promised to commit but 
        don't know the decision. This is a blocking protocol.
      

Three-Phase Commit (3PC)

Adds a "pre-commit" phase to make the protocol non-blocking. But it doesn't work correctly with network partitions, so it's rarely used in practice.

Distributed Transactions

The Problem

A transaction spans multiple partitions or databases. We need all-or-nothing semantics across independent systems.

// Transfer $100 from Account A (Bank 1) to Account B (Bank 2)
BEGIN TRANSACTION
  Bank1.debit(A, 100)   // What if this succeeds...
  Bank2.credit(B, 100)  // ...but this fails?
COMMIT

Solutions

Approach	Pros	Cons
2PC	Strong guarantees	Blocking, coordinator is SPOF
Saga	Non-blocking, good availability	Compensating transactions needed, eventual consistency
Avoid	Simple, fast	Design constraints, may not be possible

Saga Pattern: Break the transaction into steps with compensating actions. If step N fails, run compensations for steps N-1, N-2, ... 1.

1. debit(A, 100)        compensate: credit(A, 100)
2. credit(B, 100)       compensate: debit(B, 100)

If step 2 fails → run compensation 1 → credit(A, 100)

Key Takeaways

Linearizability is the gold standard but expensive. Use when correctness matters more than availability.
Consensus is equivalent to: total order broadcast, linearizable storage, locks, leader election.
2PC provides atomicity but can block. Consider sagas for better availability.
In practice, most systems use Raft or Paxos for consensus (via etcd, ZooKeeper, etc.).
The CAP theorem means you must choose: linearizability OR availability during partitions.