← System design interview questions
Rate Limiter System Design
Rate limiting is where product policy meets distributed systems. You are not only choosing an algorithm—you are deciding who gets protected when the system is stressed: honest users, noisy neighbors, scrapers, or your own buggy client release. Interviewers want to see you separate measurement (how many requests in a window) from enforcement (HTTP 429, queueing, shedding) and from observability (metrics that prove the limiter is not the thing melting first).
Algorithms you should be able to compare
- Token bucket: smooth bursts, intuitive for "credits refill at r per second." Good story for APIs that tolerate short spikes.
- Fixed window: simple counters, but boundary effects can double traffic at the seam—say that out loud before the interviewer does.
- Sliding window log or counter approximation: better fairness, more memory or coordination cost; discuss Redis sorted sets, approximate structures, or hybrid approaches.
Distributed reality checks
Clock skew, partial failures, and hot keys on a shared counter are not edge cases—they are Tuesday. Mention how you would shard limits per user or tenant, how you synchronize across regions if you must, and what happens when Redis hiccups: fail open vs fail closed is a product decision; show you know both have casualties.
STAR without forcing a whiteboard into a novel
Thirty seconds on a real incident beats five minutes of buzzwords: a launch where mobile retries amplified traffic, a partner integration you throttled, a botnet you identified via fingerprinting. Tie actions to metrics—429 rate, error budget burn, support tickets—and close with what you would automate next. Continue with load balancing to connect edge policy with how traffic enters your fleet.
Token bucket you can explain to finance
Tiny token bucket (Python) — same math as many gateways
import time
class TokenBucket:
def __init__(self, rate_per_sec: float, burst: float):
self.rate = rate_per_sec
self.burst = burst
self.tokens = burst
self.last = time.monotonic()
def allow(self) -> bool:
now = time.monotonic()
self.tokens = min(self.burst, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= 1:
self.tokens -= 1
return True
return FalseIn Azure you usually buy this via API Management or edge rules, but interviewers love when you can sketch the refill curve. Real incident: partner retries after 429 accidentally synchronize; you add jitter and backoff—policy plus code, not magic.
Questions with sample answers
These are interview-ready outlines—sound human by swapping in your own metrics, team names, and war stories. The examples are generic on purpose so you can map them to what you actually shipped.
Primary prompt
Design per-API-key limits plus a global cap per region so one customer cannot starve others.
Two counters (or token buckets):
limit:key:{id}andlimit:region:{r}:global. Check both before accept; decrement atomically (Lua script in Redis). Return 429 withRetry-Afterwhen either trips.Example: key allows 1k/min but region pool 100k/min—noisy neighbor hits key limit first; flash crowd hits region first.
Primary prompt
Compare token bucket vs sliding window counter for mobile clients that batch offline requests.
Token bucket: absorbs offline burst when app syncs—friendly UX if product accepts short spikes. Sliding window: stricter fairness, fewer surprises at window boundaries; may need more Redis memory or approximate algorithms (e.g. fixed window + small correction).
Primary prompt
How do you enforce limits across multiple edge POPs without perfect clocks?
Eventually consistent counters with CRDT-style merge, or central Redis cluster with sub-ms latency, or tolerate slight overcount (product call). Use logical clocks less for rate than for ordering—real answer: centralized store or gossiped budgets with slack.
Primary prompt
What metrics and dashboards prove your rate limiter is not the top source of 5xx errors?
Track 429 rate vs 5xx, limiter latency p99, Redis errors, compare to origin health. Alert when 5xx correlates with Redis timeouts, not 429 spikes. Dashboard: accepted vs rejected per tenant.
Follow-ups interviewers often ask
Expect nested "why?" questions—brief answers here; expand with your production defaults.
Follow-up
Fail open vs fail closed during Redis downtime—which do you pick for payments vs analytics?
Payments: fail closed (reject) to prevent unbounded spend—or degrade to strict local token if pre-provisioned. Analytics: fail open with sampling so product keeps moving; log exposure.
Follow-up
How do you prevent synchronized retries from creating a thundering herd after a 429 storm?
Exponential backoff + full jitter;
Retry-Afterfrom server; client randomization; circuit breaker on client; cap max concurrency.Follow-up
What is your story for burst traffic from a legitimate marketing campaign?
Pre-negotiated quota increase, separate campaign API key with higher bucket, queue non-critical work, scale origin—communicate with marketing before launch.
Follow-up
How do you test fairness when tenants have wildly different traffic shapes?
Load tests with synthetic tenants; verify p99 latency per tenant under contention; chaos Redis delay; assert small tenant not starved by whale using weighted fair queueing if needed.
Follow-up
Where do you store counters and why not in the application memory of a single node?
Multi-node fleets need shared state—Redis/Memcached/Dynamo—otherwise each node has partial view and users hit different limits. Sticky sessions reduce but don't solve global fairness.