System design interview questions

Rate Limiter System Design

Rate limiting is where product policy meets distributed systems. You are not only choosing an algorithm—you are deciding who gets protected when the system is stressed: honest users, noisy neighbors, scrapers, or your own buggy client release. Interviewers want to see you separate measurement (how many requests in a window) from enforcement (HTTP 429, queueing, shedding) and from observability (metrics that prove the limiter is not the thing melting first).

Algorithms you should be able to compare

Distributed reality checks

Clock skew, partial failures, and hot keys on a shared counter are not edge cases—they are Tuesday. Mention how you would shard limits per user or tenant, how you synchronize across regions if you must, and what happens when Redis hiccups: fail open vs fail closed is a product decision; show you know both have casualties.

STAR without forcing a whiteboard into a novel

Thirty seconds on a real incident beats five minutes of buzzwords: a launch where mobile retries amplified traffic, a partner integration you throttled, a botnet you identified via fingerprinting. Tie actions to metrics—429 rate, error budget burn, support tickets—and close with what you would automate next. Continue with load balancing to connect edge policy with how traffic enters your fleet.

Token bucket you can explain to finance

Tiny token bucket (Python) — same math as many gateways

import time

class TokenBucket:
    def __init__(self, rate_per_sec: float, burst: float):
        self.rate = rate_per_sec
        self.burst = burst
        self.tokens = burst
        self.last = time.monotonic()

    def allow(self) -> bool:
        now = time.monotonic()
        self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
        self.last = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

In Azure you usually buy this via API Management or edge rules, but interviewers love when you can sketch the refill curve. Real incident: partner retries after 429 accidentally synchronize; you add jitter and backoff—policy plus code, not magic.

Questions with sample answers

These are interview-ready outlines—sound human by swapping in your own metrics, team names, and war stories. The examples are generic on purpose so you can map them to what you actually shipped.

  1. Primary prompt

    Design per-API-key limits plus a global cap per region so one customer cannot starve others.

    Two counters (or token buckets): limit:key:{id} and limit:region:{r}:global. Check both before accept; decrement atomically (Lua script in Redis). Return 429 with Retry-After when either trips.

    Example: key allows 1k/min but region pool 100k/min—noisy neighbor hits key limit first; flash crowd hits region first.

  2. Primary prompt

    Compare token bucket vs sliding window counter for mobile clients that batch offline requests.

    Token bucket: absorbs offline burst when app syncs—friendly UX if product accepts short spikes. Sliding window: stricter fairness, fewer surprises at window boundaries; may need more Redis memory or approximate algorithms (e.g. fixed window + small correction).

  3. Primary prompt

    How do you enforce limits across multiple edge POPs without perfect clocks?

    Eventually consistent counters with CRDT-style merge, or central Redis cluster with sub-ms latency, or tolerate slight overcount (product call). Use logical clocks less for rate than for ordering—real answer: centralized store or gossiped budgets with slack.

  4. Primary prompt

    What metrics and dashboards prove your rate limiter is not the top source of 5xx errors?

    Track 429 rate vs 5xx, limiter latency p99, Redis errors, compare to origin health. Alert when 5xx correlates with Redis timeouts, not 429 spikes. Dashboard: accepted vs rejected per tenant.

Follow-ups interviewers often ask

Expect nested "why?" questions—brief answers here; expand with your production defaults.

  1. Follow-up

    Fail open vs fail closed during Redis downtime—which do you pick for payments vs analytics?

    Payments: fail closed (reject) to prevent unbounded spend—or degrade to strict local token if pre-provisioned. Analytics: fail open with sampling so product keeps moving; log exposure.

  2. Follow-up

    How do you prevent synchronized retries from creating a thundering herd after a 429 storm?

    Exponential backoff + full jitter; Retry-After from server; client randomization; circuit breaker on client; cap max concurrency.

  3. Follow-up

    What is your story for burst traffic from a legitimate marketing campaign?

    Pre-negotiated quota increase, separate campaign API key with higher bucket, queue non-critical work, scale origin—communicate with marketing before launch.

  4. Follow-up

    How do you test fairness when tenants have wildly different traffic shapes?

    Load tests with synthetic tenants; verify p99 latency per tenant under contention; chaos Redis delay; assert small tenant not starved by whale using weighted fair queueing if needed.

  5. Follow-up

    Where do you store counters and why not in the application memory of a single node?

    Multi-node fleets need shared state—Redis/Memcached/Dynamo—otherwise each node has partial view and users hit different limits. Sticky sessions reduce but don't solve global fairness.