Distributed systems fundamentals from DDIA Chapter 9
📚 Designing Data-Intensive Applications - Martin KleppmannIn a distributed system, nodes can fail, networks can partition, and messages can be delayed or lost. Despite this, we often need nodes to agree on something — which node is the leader, whether a transaction should commit, or what value to store.
This is harder than it sounds. The FLP impossibility result proves that no deterministic consensus algorithm can guarantee progress in an asynchronous system where even one node can crash.
The strongest single-object consistency model. Once a write completes, all subsequent reads must see that value or a later one. The system behaves as if there's only one copy of the data.
// Linearizable behavior
Client 1: write(x, 1) ────────────────►
Client 2: read(x) → 1 ───►
Client 3: read(x) → 1 (must see 1, not old value)
Cost: Requires coordination. In the presence of network partitions, you must choose between availability and linearizability (CAP theorem).
Use cases: Leader election, distributed locks, unique constraints.
All operations appear in some sequential order that's consistent with the program order of each client. But this order doesn't have to match real-time order.
// Sequentially consistent (but not linearizable)
Client 1: write(x, 1) at T=0
Client 2: read(x) → 0 at T=1 // OK! Can still see old value
Client 2: read(x) → 1 at T=2 // Once you see 1, can't go back to 0
Operations that are causally related must be seen in the same order by all nodes. Concurrent operations (no causal relationship) can be seen in different orders.
// Causal relationship
Alice: post("Hello") // A
Bob: read(Alice's post) // B (B depends on A)
Bob: reply("Hi Alice!") // C (C depends on B, transitively on A)
Everyone must see: A → B → C
But concurrent posts from Carol can appear anywhere.
Advantage: Can be implemented without global coordination, better availability.
The weakest useful guarantee. If no new updates are made, eventually all replicas will converge to the same value. No guarantees about when or what you'll read in the meantime.
// Eventually consistent
write(x, 1) to node A
read(x) from node B → might return old value
... time passes, replication happens ...
read(x) from node B → eventually returns 1
Use cases: DNS, caches, social media feeds (where staleness is acceptable).
Click nodes to simulate failures. Watch how the cluster elects a new leader.
| Algorithm | Used By | Key Idea |
|---|---|---|
| Paxos | Chubby, Spanner | Two-phase: prepare/accept. Notoriously hard to understand. |
| Raft | etcd, Consul, CockroachDB | Leader-based. Designed for understandability. Log replication. |
| Zab | ZooKeeper | Similar to Raft but predates it. Atomic broadcast. |
| Viewstamped Replication | Academic | Predecessor to Raft. View changes for leader election. |
A protocol for atomic commits across multiple nodes. Either all nodes commit, or all abort.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Coordinator │ │ Participant │ │ Participant │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│──── PREPARE ─────────►│ │
│──── PREPARE ──────────────────────────────────►
│ │ │
│◄──── YES ─────────────│ │
│◄──── YES ─────────────────────────────────────│
│ │ │
│──── COMMIT ──────────►│ │
│──── COMMIT ───────────────────────────────────►
│ │ │
▼ ▼ ▼
Adds a "pre-commit" phase to make the protocol non-blocking. But it doesn't work correctly with network partitions, so it's rarely used in practice.
A transaction spans multiple partitions or databases. We need all-or-nothing semantics across independent systems.
// Transfer $100 from Account A (Bank 1) to Account B (Bank 2)
BEGIN TRANSACTION
Bank1.debit(A, 100) // What if this succeeds...
Bank2.credit(B, 100) // ...but this fails?
COMMIT
| Approach | Pros | Cons |
|---|---|---|
| 2PC | Strong guarantees | Blocking, coordinator is SPOF |
| Saga | Non-blocking, good availability | Compensating transactions needed, eventual consistency |
| Avoid | Simple, fast | Design constraints, may not be possible |
1. debit(A, 100) compensate: credit(A, 100)
2. credit(B, 100) compensate: debit(B, 100)
If step 2 fails → run compensation 1 → credit(A, 100)