CAP Theorem - Understanding Distributed Systems
Master the CAP theorem and learn how to make trade-offs between Consistency, Availability, and Partition Tolerance in distributed systems.
CAP Theorem - Understanding Distributed Systems
What is CAP Theorem?
The CAP theorem, also known as Brewer's theorem, states that a distributed system can only guarantee two out of three properties simultaneously:
- Consistency (C): All nodes see the same data at the same time
- Availability (A): Every request receives a response (success or failure)
- Partition Tolerance (P): System continues to operate despite network partitions
The Three Properties Explained
Consistency
User writes to Node A: value = 100
User reads from Node B: value = 100 ✓
All nodes return the same, most recent data.
Example: Banking systems must be consistent. If you transfer $100, all ATMs should immediately reflect the new balance.
Availability
Request to Node A: Response ✓
Request to Node B: Response ✓
Request to Node C: Response ✓
Every request gets a response, even if it's not the latest data.
Example: Social media feeds prioritize availability. It's okay if you see slightly outdated posts, but the app must always respond.
Partition Tolerance
Network Split:
[Node A, Node B] | [Node C, Node D]
↑ ↑
Can't communicate
System continues to function despite the split.
Example: In a global system, network issues between data centers shouldn't bring down the entire system.
Why Can't We Have All Three?
The Fundamental Trade-off
When a network partition occurs, you must choose:
-
Sacrifice Consistency (AP System)
- Continue serving requests
- Accept that nodes may have different data
- Example: DNS, Cassandra
-
Sacrifice Availability (CP System)
- Stop serving requests until consistency is restored
- Ensure all nodes have the same data
- Example: Banking systems, MongoDB (with strong consistency)
CAP Combinations
CA (Consistency + Availability)
Not realistic in distributed systems!
Why? Because network partitions are inevitable in distributed systems. You can't ignore partition tolerance.
Use Case: Single-node databases (not distributed)
- Traditional RDBMS on a single server
- No network partitions to worry about
CP (Consistency + Partition Tolerance)
Choose consistency over availability during partitions
Scenario: Network partition occurs
Decision: Stop serving requests until consistency is restored
Example Flow:
1. Network partition detected
2. System enters read-only mode or rejects writes
3. Wait for partition to heal
4. Restore consistency
5. Resume normal operations
Real-World Examples:
- MongoDB (with strong consistency settings)
- HBase
- Redis (with synchronous replication)
- Zookeeper
Use Cases:
- Financial transactions
- Inventory management
- Booking systems
- Any system where stale data is unacceptable
AP (Availability + Partition Tolerance)
Choose availability over consistency during partitions
Scenario: Network partition occurs
Decision: Continue serving requests, accept eventual consistency
Example Flow:
1. Network partition detected
2. Both sides continue serving requests
3. Data diverges temporarily
4. Partition heals
5. Conflict resolution (last-write-wins, vector clocks, etc.)
6. Eventually consistent
Real-World Examples:
- Cassandra
- DynamoDB
- Riak
- CouchDB
Use Cases:
- Social media feeds
- Shopping carts
- User profiles
- Analytics data
- Any system where availability is critical
Real-World Examples
Example 1: E-commerce Shopping Cart (AP)
Scenario: User adds items to cart during network partition
User in US → Node A (US)
User in EU → Node B (EU)
Network partition between US and EU
Decision: AP System
- Both nodes continue serving requests
- User can add items to cart
- Carts may diverge temporarily
- Merge carts when partition heals
- Better to have duplicate items than unavailable cart
Example 2: Bank Account Balance (CP)
Scenario: User tries to withdraw money during partition
User at ATM A → Node A
User at ATM B → Node B
Network partition between nodes
Decision: CP System
- Reject withdrawal requests during partition
- Prevent overdraft
- Wait for partition to heal
- Better to be unavailable than inconsistent
Example 3: Social Media Posts (AP)
Scenario: User posts during network partition
User posts → Node A
Followers read from Node B (partition)
Decision: AP System
- Post is accepted immediately
- Followers may see old feed temporarily
- Eventually, all nodes sync
- Better to show slightly stale feed than no feed
Consistency Models
Strong Consistency (CP)
Write(x=1) → Node A
Read(x) → Node B = 1 ✓
Immediate consistency across all nodes
Eventual Consistency (AP)
Write(x=1) → Node A
Read(x) → Node B = 0 (stale)
... wait ...
Read(x) → Node B = 1 ✓
Eventually consistent, but may be stale temporarily
Causal Consistency
Write(x=1) → Node A
Write(y=2, depends on x) → Node A
Read(y) → Node B = 2
Read(x) → Node B = 1 ✓
Causally related operations are ordered
How to Choose?
Choose CP When:
- ✅ Data accuracy is critical
- ✅ Financial transactions
- ✅ Inventory management
- ✅ Booking systems
- ✅ Compliance requirements
- ❌ Can tolerate temporary unavailability
Choose AP When:
- ✅ Availability is critical
- ✅ User experience matters most
- ✅ Stale data is acceptable
- ✅ Social media, feeds
- ✅ Caching layers
- ❌ Can tolerate temporary inconsistency
Beyond CAP: PACELC
PACELC extends CAP theorem:
If Partition (P):
Choose Availability (A) or Consistency (C)
Else (E):
Choose Latency (L) or Consistency (C)
Key Insight: Even without partitions, you must trade off latency vs consistency.
Examples:
- PA/EL: Cassandra (Availability during partition, Low latency normally)
- PC/EC: MongoDB (Consistency during partition, Consistency normally)
- PA/EC: DynamoDB (Availability during partition, Consistency normally)
Practical Strategies
1. Quorum-Based Replication
N = Total replicas
W = Write quorum
R = Read quorum
Strong Consistency: W + R > N
Example: N=3, W=2, R=2 → 2+2 > 3 ✓
Eventual Consistency: W + R ≤ N
Example: N=3, W=1, R=1 → 1+1 ≤ 3 ✓
2. Multi-Region Strategy
Region A (Primary)
├── Synchronous replication → Region B (Secondary)
└── Asynchronous replication → Region C (DR)
Trade-off:
- Sync: Strong consistency, higher latency
- Async: Eventual consistency, lower latency
3. Hybrid Approach
Critical Data (CP):
├── User authentication
├── Payment processing
└── Inventory counts
Non-Critical Data (AP):
├── User profiles
├── Social feeds
└── Analytics
Interview Questions
Q1: Can you have a CA system in distributed systems?
Answer: No, because network partitions are inevitable in distributed systems. You must handle partition tolerance. CA systems only exist in single-node databases.
Q2: How does Cassandra achieve high availability?
Answer:
- Peer-to-peer architecture (no single point of failure)
- Tunable consistency (choose per query)
- Hinted handoff (store writes during node failure)
- Eventually consistent by default
- Gossip protocol for node discovery
Q3: Why do banks choose CP over AP?
Answer:
- Data accuracy is critical (no overdrafts)
- Regulatory compliance requirements
- Better to be temporarily unavailable than inconsistent
- Financial loss from inconsistency > loss from unavailability
Common Misconceptions
❌ Myth 1: "CAP means you can only pick 2"
Reality: You must always have P (partition tolerance) in distributed systems. The real choice is between C and A during partitions.
❌ Myth 2: "AP systems are always inconsistent"
Reality: AP systems are eventually consistent. They converge to consistency over time.
❌ Myth 3: "CP systems are always unavailable"
Reality: CP systems are only unavailable during network partitions, which are rare.
Conclusion
Key Takeaways:
- CAP theorem is about trade-offs during network partitions
- Partition tolerance is mandatory in distributed systems
- Choose CP for critical data, AP for user experience
- Most systems use a hybrid approach
- Consider PACELC for a more complete picture
Remember: There's no one-size-fits-all solution. Choose based on your specific requirements and use cases.