System design interview questions

CAP Theorem System Design

CAP is a teaching tool that interviewers use to see whether you can discuss trade-offs without slogans. The useful version of the conversation is not "pick two letters" but "when the network lies, which guarantees does this product need, what do we tell the user, and how do we heal afterward?" Strong candidates name concrete models: read-your-writes, monotonic reads, causal consistency, or linearizability for a lock service—and connect them to a datastore they have actually run.

What to say about partitions

Make it grounded

Walk through a scenario: leader election fails, two writers believe they are special, reconciliation rules. If you have lived through split-brain-ish symptoms—even at smaller scale—describe detection (metrics, alerts) and remediation (disable writes, promote replica, replay logs). Interviewers remember calm specifics.

Loop back to the bigger picture

CAP sits next to rate limiting and load balancing in most senior loops. Review rate limiters to practice explaining user-visible degradation: sometimes the kindest thing under partition is a fast error and a clear retry policy—not silent wrong answers.

Cosmos DB session token in plain English

Read-your-writes with a session token (conceptual .NET / REST)

// After write, capture session token from response header
var writeResp = await container.CreateItemAsync(order, pk);
var session = writeResp.Headers.Session;

// Subsequent read from same user/device includes token
var req = new ItemRequestOptions { SessionToken = session };
var read = await container.ReadItemAsync<Order>(id, pk, req);

Azure Cosmos DB makes consistency tangible: without the session token, a mobile user might post a comment and refresh to empty—classic "but I just saved!" CAP storytelling. You explain the product choice: sticky reads for UX vs global linearizability for inventory counts.

Questions with sample answers

These are interview-ready outlines—sound human by swapping in your own metrics, team names, and war stories. The examples are generic on purpose so you can map them to what you actually shipped.

  1. Primary prompt

    Compare linearizability vs sequential consistency for a distributed lock—who needs which?

    Linearizability: every operation appears instantaneous between invocation and response—strong for locks/fencing tokens where "I won the lock" must be globally agreed now. Sequential consistency: all nodes agree on some order, not necessarily real-time—sometimes enough for read-mostly coordination with looser latency.

  2. Primary prompt

    During a partition, our app shows stale inventory. Is that acceptable—how do you decide with product?

    Trade UX vs overselling: show "may be outdated" banner, disable checkout, or reserve pessimistically. Document RPO/RTO; finance/legal often pick "no oversell" even if listings look stale.

  3. Primary prompt

    Explain read-your-writes with a session stickiness story vs a centralized sequencer.

    Sticky routing + replica: user's reads follow writes to same replica that applied write— simple, breaks if replica fails. Sequencer / global log: everyone reads past sequence number— stronger, higher latency. Cosmos session token is concrete example of sticky read-your-writes.

  4. Primary prompt

    How does your favorite database document its consistency during failures—what did you verify in practice?

    Read vendor docs on failover (Postgres sync rep, Cosmos consistency levels, Dynamo eventual + optional strong reads). Verify with chaos: kill primary during write, observe client errors and recovery; measure read-after-write behavior your app assumed.

Follow-ups interviewers often ask

Expect nested "why?" questions—brief answers here; expand with your production defaults.

  1. Follow-up

    Why is latency often the hidden fourth variable in real designs?

    Strong consistency across regions costs round trips—CAP talks partition; latency is daily pain—users feel slow writes before partitions happen.

  2. Follow-up

    What manual reconciliation jobs exist in systems you have run?

    Payment vs ledger mismatch, duplicate webhook delivery, inventory drift—scheduled jobs with alerts, dead-letter queues, human review for large diffs.

  3. Follow-up

    How do you test split-brain mitigations without staging a real partition?

    Toxiproxy/network policy blackholes, chaos monkey blocking AZ, integration tests with injected clock skew; game days with runbook validation.

  4. Follow-up

    When would you choose a CP system for metadata but AP for content delivery?

    Metadata (who owns file) must not fork—CP. Static assets can serve stale from edge—AP with TTL; users rarely notice slightly old image.

  5. Follow-up

    How do you communicate trade-offs to non-engineering stakeholders without jargon?

    Use outcomes: "faster everywhere vs never wrong on stock count"; dollars and support tickets; diagrams with one sentence each; offer two options with risks.