This tests:
-
System design depth
-
Understanding of distributed systems
-
Trade-off navigation (CAP, consistency, latency)
-
Real-world edge case handling
Let’s go step by step and design Redis-like cache from first principles, not using cloud-managed services.
🚀 Goal: Build a Redis-like Distributed In-Memory Cache
🧾 1. Requirements Gathering (Clarify with interviewer)
🔹 Functional
-
Support
GET
,SET
,DEL
,TTL
-
Handle concurrent reads/writes
-
Cache keys across multiple nodes
-
Optional: Support pub/sub, data structures (hash, list)
🔹 Non-Functional
-
Low latency (<1ms typical)
-
High availability & fault tolerance
-
Scalable horizontally
-
Eventual or strong consistency
-
Memory-optimized with TTL eviction
Absolutely! Back-of-the-envelope estimations are crucial in system design interviews — they demonstrate your pragmatism, ability to roughly size a system, and to make sound trade-offs.
Let’s break it down for your Redis-like Distributed In-Memory Cache System:
🧠 Scenario:
Let’s say you're designing this for an AI/ML pipeline system, like Google's CMCS ML. It caches:
-
Intermediate model data
-
Feature store results
-
Token metadata
-
Configuration data
📌 Estimation Goals:
We’ll estimate for:
What | Example |
---|---|
🔹 Number of keys | e.g., 100 million |
🔹 Size per key | e.g., average 1KB |
🔹 Total memory footprint | GB / TB scale |
🔹 QPS (Queries Per Second) | For read/write traffic |
🔹 Node count and distribution | |
🔹 Network bandwidth | |
🔹 TTL / Eviction rates |
⚙️ Step-by-Step Estimation
🔹 1. Number of Keys
Let’s say each ML workflow (pipeline) generates:
-
10k intermediate cacheable entries
-
1M workflows per day (across all users)
10k keys/workflow × 1M workflows/day = 10B keys/day
But not all stay in memory. We retain 10% for hot data in memory:
-
10B × 10% = 1B keys
cached at peak
🔹 2. Average Key Size
Let’s assume:
-
Key name: ~100 bytes
-
Value: ~900 bytes
-
TTL/metadata: ~20 bytes overhead
Total = 1KB per key
📦 3. Total Memory Requirement
1B keys × 1KB = 1,000,000,000 KB = ~1 TB
So you’d need ~1 TB of RAM across your cluster
Let’s budget for 30% overhead (replication, GC, fragmentation):
➡️ Effective: ~1.3 TB RAM
🧵 4. QPS (Queries Per Second)
Assume:
-
Each key gets ~10 reads per day → 10B reads/day
-
1% of keys get hit 90% of the time (Zipfian)
10B reads/day ≈ 115,740 reads/sec
Writes: 1B/day ≈ 11,500 writes/sec
Target QPS:
-
Read QPS: 100K–150K
-
Write QPS: 10K–20K
🧑🤝🧑 5. Number of Nodes
If 1 machine supports:
-
64 GB usable memory
-
10K QPS (to be safe)
-
10 Gbps NIC
Then:
-
RAM: 1.3 TB / 64 GB ≈ 20 nodes
-
QPS: 150K / 10K = 15 nodes
Plan for ~25–30 nodes (for headroom and HA)
🔁 6. Replication Overhead
Assuming:
-
1 replica per shard for failover
-
2× memory and network cost
➡️ RAM required: ~2.6 TB ➡️ Bandwidth: double write traffic (~20K writes/sec * 1KB = ~20 MB/sec replication stream)
📶 7. Network Bandwidth
Let’s estimate:
-
150K reads/sec × 1KB = 150 MB/s
-
20K writes/sec × 1KB = 20 MB/s
-
Replication = 20 MB/s
📌 Each node should handle:
-
Read bandwidth: ~6 MB/s
-
Write + replication: ~2 MB/s
-
Easily handled by 10 Gbps NIC
⏳ 8. Eviction Rate
Assuming TTL = 1 hour, and 1B keys:
-
Evictions per second = 1B / (60×60) ≈ 277K keys/sec
Eviction algorithm must be efficient:
-
LRU clock algo or async TTL scanner needed
✅ Final Summary
Metric | Estimation |
---|---|
Total keys | 1 billion |
Avg size per key | 1 KB |
Total RAM (w/ overhead) | ~2.6 TB (with replication) |
Nodes | 25–30 (for HA, QPS, memory headroom) |
Read QPS | ~150K/sec |
Write QPS | ~15–20K/sec |
Eviction rate | ~250–300K/sec |
Network per node | ~10 MB/s total (within 10Gbps budget) |
🎯 Bonus: What Google Might Ask
What would change if you needed to support multi-tenant isolation?
→ Talk about namespacing keys, quota control, per-tenant memory buckets.
What if a single user uploads a 1GB object?
→ Chunk large values or offload to Blob storage and cache pointer.
How would you reduce memory cost?
→ TTL tuning, compression (LZ4), lazy expiration.
🧱 2. High-Level Architecture
+------------------------+
| Client Applications |
+------------------------+
|
v
+------------------+
| Coordinator / |
| Cache Router | (Optional)
+------------------+
|
+------------------+------------------+
| |
+----------+ +-------------+
| Cache | <-- Gossip/Heartbeat --> | Cache |
| Node A | Protocol | Node B |
+----------+ +-------------+
| |
+------------+ +-------------+
| Memory DB | | Memory DB |
+------------+ +-------------+
🧠 3. Core Components
🔸 a. Data Storage (In-Memory)
-
Use hash maps in memory for key-value store
-
TTLs stored with each key (for expiry eviction)
-
Optionally support data types like list, hash, etc.
store = {
"foo": { value: "bar", expiry: 1681450500 },
...
}
🔸 b. Shard & Partition
-
Use consistent hashing to assign keys to nodes
-
Each key
K
→hash(K) % N
where N = number of virtual nodes
This avoids rehashing all keys when nodes are added/removed
🔸 c. Cache Router / Coordinator
-
Client can compute hash OR use a proxy router to route to correct cache node
-
Think Twemproxy or Envoy as L7 proxy
🔸 d. Replication
-
Master-Replica model
-
Writes go to master → replicate to replica (async or sync)
-
Replicas take over on master failure
Node A (Master)
└── Replica A1
🔸 e. Eviction Strategy
-
Support TTL for automatic key expiry
-
Support LRU / LFU / random eviction when memory full
-
Track access counters for eviction ranking
🔸 f. Concurrency
-
Use read-write locks or CAS for thread-safe operations
-
Each node is single-threaded (like Redis) or use event-loop
🔁 4. Replication and Fault Tolerance
Gossip/Heartbeat
-
Nodes send periodic heartbeats to detect failures
-
Use Raft or Gossip protocols for membership
Failover
-
If a master dies → elected replica takes over
-
Redis uses Redis Sentinel to coordinate failover
🧪 5. Optional Features
Feature | Description |
---|---|
Pub/Sub | Add pub/sub channels per key prefix |
Persistence | Periodically write snapshots to disk (RDB), or append-only logs (AOF) |
Backpressure | Queue or reject new writes if memory full |
Client Library | SDKs to abstract hash ring + failover |
🔁 6. CAP Trade-off
CAP Property | Redis Default |
---|---|
Consistency | Eventual (async replication) |
Availability | High (failover supported) |
Partition Tolerance | Yes |
📊 Metrics to Expose
-
Hit/Miss rate
-
Memory usage
-
Eviction count
-
Node replication lag
-
Key distribution balance
🧠 Questions Google May Ask You During This Design
-
How would you handle hot keys (one key accessed by many nodes)?
-
How do you rebalance keys after node failure?
-
Would you support atomic operations (INCR, DECR)?
-
How do you prevent cache stampede (multiple clients recomputing a missing key)?
-
How do you handle large objects that exceed single node memory?
🛠️ Tech Stack (if you had
to build it)
Component |
Tech |
Cache server |
C++ / Rust / Go (like Redis) |
Hashing & comm |
Consistent hashing + TCP sockets / gRPC |
Replication |
Raft or custom pub/sub protocol |
Persistence |
RocksDB / Flat file / Append logs |
Observability |
Prometheus + Grafana |
Perfect — let’s walk through each of those Staff-level system design questions, and I’ll give you solid answers with reasoning, trade-offs, and implementation suggestions. You can use these to explain your decisions clearly in the interview.
🔥 1. How would you handle hot keys (one key accessed by many nodes)?
Problem:
-
One key (like
"user:token:123"
) gets disproportionate access → overloads one node.
✅ Solutions:
-
Key Replication (Read Replicas):
-
Replicate hot key across multiple nodes.
-
Client-side read load balancing (randomly pick from replicas).
-
Write goes to master → async to replicas.
-
-
Key Sharding (Virtual Keys):
-
Split key logically, e.g.,
"video:views"
→video:views:1
,...:2
-
Clients randomly select a shard for read/write → reduce contention.
-
Aggregate during reads (costly but effective).
-
-
Request Deduplication & Caching at Edge:
-
Use edge cache (like CDN or client-side cache) for super-hot keys.
-
-
Rate Limiting / Backpressure:
-
Throttle requests to that key, or queue them on overload.
-
Interview Tip:
Emphasize dynamic detection of hot keys (via metrics), and adaptive replication or redirection.
💡 2. How do you rebalance keys after node failure?
Problem:
-
Node failure → key space imbalance.
-
Some nodes overloaded, others underused.
✅ Solutions:
-
Consistent Hashing + Virtual Nodes:
-
Redistribute virtual nodes (vNodes) from failed node to others.
-
Only keys for those vNodes get rebalanced — minimal movement.
-
-
Auto-Failover & Reassignment:
-
Use heartbeat to detect failure.
-
Other nodes take over lost slots or ranges.
-
-
Key Migration Tools:
-
Background rebalance workers move keys to even out load.
-
Ensure write consistency during move via locking/versioning.
-
-
Client-Side Awareness:
-
Clients get updated ring view and re-route requests accordingly.
-
Interview Tip:
Talk about graceful degradation during rebalancing and minimizing downtime.
⚙️ 3. Would you support atomic operations (INCR, DECR)?
✅ Yes — atomic operations are essential in a caching layer (e.g., counters, rate limits, tokens).
Implementation:
-
Single-Threaded Execution Model:
-
Like Redis: handle each command sequentially on single-threaded event loop → natural atomicity.
-
-
Compare-And-Swap (CAS):
-
For multi-threaded or multi-process setups.
-
Use version numbers or timestamps to detect stale updates.
-
-
Locks (Optimistic/Pessimistic):
-
Apply locks on keys for write-modify-write operations.
-
Use with caution to avoid performance degradation.
-
-
Use CRDTs (Advanced Option):
-
Conflict-free data types (e.g., GCounter, PNCounter) for distributed atomicity.
-
Interview Tip:
Highlight that simplicity, speed, and correctness are the priority. Lean toward single-threaded per-key operation for atomicity.
🧊 4. How do you prevent cache stampede (multiple clients recomputing a missing key)?
Problem:
-
TTL expires → 1000 clients query same missing key → backend DDoS.
✅ Solutions:
-
Lock/SingleFlight:
-
First client computes and sets value.
-
Others wait for value to be written (or reused from intermediate store).
-
Go has
sync/singleflight
, Redis can simulate with Lua locks.
-
-
Stale-While-Revalidate (SWR):
-
Serve expired value temporarily.
-
In background, refresh the cache asynchronously.
-
-
Request Coalescing at API Gateway:
-
Gateway buffers duplicate requests until cache is ready.
-
-
Early Refresh Strategy:
-
Monitor popular keys.
-
Proactively refresh before TTL expiry.
-
Interview Tip:
Describe this as a read-heavy resilience pattern. Emphasize proactive + reactive strategies.
📦 5. How do you handle large objects that exceed single node memory?
Problem:
-
A single large key (e.g., serialized ML model, 1GB) doesn't fit in one node.
✅ Solutions:
-
Key Chunking (Manual Sharding):
-
Split large value into multiple keys (
file:1
,file:2
,file:3
). -
Store each chunk on different nodes.
-
Reassemble during read.
-
-
Redirect to Object Store:
-
If object > X MB → store in Blob/File system (Azure Blob / GCS).
-
Cache a pointer/reference in cache instead.
-
-
Use a Tiered Cache:
-
Store large objects in a slower (but scalable) cache (like disk-based).
-
Fast cache for hot small keys; slow cache for bulkier data.
-
-
Compression:
-
Use lightweight compression (LZ4, Snappy) before storing.
-
Interview Tip:
Discuss threshold-based offloading and trade-off between latency vs. capacity.
Comments
Post a Comment