StackoverflowTips: Scenario: Design a Scalable URL Shortener

Question:

Imagine you are tasked with designing a system similar to Bitly that converts long URLs into short ones. The system should handle billions of URLs and millions of requests per second. Please explain how you would design this system.

Requirements:

function requirements: shortener, redirection, expiry of url

non-functional: high availability, fault tolerant (like AP from CAP)

Step 1: High-Level Design

At a high level, the system will have the following components:

Frontend Service: Handles user requests for shortening, redirection, and URL expiry.
Backend Service: Processes requests, generates short URLs, manages expiration policies, and stores mappings.
Database: Stores the short-to-long URL mappings.
Cache: Speeds up redirection for frequently accessed URLs.
Load Balancer: Distributes incoming traffic evenly across backend servers to handle high availability and fault tolerance.

Step 2: Capacity Planning

Now, let's expand on capacity planning for this system:

Storage:
- Assume 1 billion URLs in the system.
- Average size for a record (short URL + long URL + metadata like expiry date) = 150 bytes.
- Total storage required: 1 billion x 150 bytes = ~150 GB.
- With 3x replication for fault tolerance, total storage: ~450 GB.
Traffic:
- Peak traffic: 10,000 requests/sec for redirection.
- Each server can handle 1,000 requests/sec, so you'll need 10 servers at peak load.
- Cache hit ratio: Assume 80% of requests hit the cache (Redis).
- Only 20% of requests (2,000/sec) hit the database.
Cache Size:
- Frequently accessed URLs (~10% of all URLs): 100 million URLs.
- Average size of a cached record: 150 bytes.
- Total cache size: ~15 GB (enough for Redis to handle).
Bandwidth:
- Each redirection involves ~500 bytes of data transfer (request + response).
- For 10,000 requests/sec: 500 bytes x 10,000 = ~5 MB/sec bandwidth requirement.

Step 3: Detailed Design

Frontend:
- Simple UI/API for creating short URLs and redirecting to original URLs.
- API design:
  - POST /shorten: Accepts a long URL and returns a short URL.
  - GET /redirect/<short-url>: Redirects to the original URL.
Backend:
- URL Shortening:
  - Generate unique short URLs using Base62 encoding or a random hash.
  - Ensure collision resistance by checking the database for duplicates.
- URL Redirection:
  - Lookup the long URL in the cache first. If not found, fetch it from the database.
- Expiry Management:
  - Use a background job to periodically clean expired URLs from the database.
Database:
- Use a NoSQL database like Cassandra or DynamoDB for scalability.
- Key-Value schema:
  - Key: Short URL.
  - Value: Original URL + metadata (creation time, expiry time).
- Partitioning: Shard data based on the hash of the short URL.
Cache:
- Use Redis for caching frequently accessed URLs.
- Implement TTL (Time-to-Live) to automatically remove expired cache entries.
Load Balancer:
- Use a load balancer (e.g., Nginx or AWS ELB) to distribute traffic across backend servers.

Step 4: Handling Edge Cases

Hash Collisions:
- Handle collisions by appending random characters to the short URL.
Expired URLs:
- Redirect users to an error page if the URL has expired.
Invalid URLs:
- Validate input URLs before storing them.
High Traffic Spikes:
- Scale horizontally by adding more backend and cache servers.

Step 5: CAP Theorem and Non-Functional Requirements

Consistency (C) is sacrificed since we prefer availability (A) and partition tolerance (P).
In case of database partitioning, short URLs may not immediately replicate globally but redirections will still work.

Final Diagram

Here’s a simple architecture for your system:

Client -> Load Balancer -> Backend Service -> Cache (Redis) -> Database
                   -> URL Expiry Job -> Clean Expired URLs in Database

This design ensures scalability, fault tolerance, and high availability. Feel free to dive deeper into any component or ask about the trade-offs in the design!

Trade-Off Example: NoSQL vs. Relational Database

Context:

In the design, we opted for a NoSQL database (e.g., Cassandra or DynamoDB) instead of a relational database like PostgreSQL or MySQL.

Why We Chose NoSQL:

Scalability: NoSQL databases are horizontally scalable. They can handle billions of records and handle massive traffic by distributing data across multiple servers.
Write Performance: URL Shorteners primarily involve write-heavy operations (e.g., inserting short-to-long URL mappings). NoSQL databases are optimized for high-throughput writes.

Trade-Off:

Consistency vs. Scalability (CAP Theorem):
- By using a NoSQL database, we prioritize availability (A) and partition tolerance (P) but sacrifice strong consistency (C). This means:
  - Short URL redirections may not immediately reflect updates if replicas are still syncing, but the system stays highly available.
  - For example, in a rare case of a database partition, a newly shortened URL might fail temporarily for a subset of users.
Flexible Queries:

NoSQL databases are optimized for key-value lookups (e.g., finding the long URL from a short one).
However, if the system later needs advanced queries (e.g., analytics: "show all URLs created in the last 7 days"), a relational database might be better suited.
This trade-off means we prioritize simplicity and performance for the current use case while limiting flexibility for future feature expansions.Trade-Off Example: Cache vs. Database
Context:
- In the design, we opted for a Redis cache to store frequently accessed URLs and reduce latency.
Why We Chose Redis Cache:
- Speed: Redis operates in-memory, enabling near-instantaneous lookups compared to database queries.
- Load Reduction: Redirect requests are served from the cache, offloading pressure from the database.
- TTL: Redis supports Time-to-Live (TTL), allowing expired URLs to be removed automatically without database intervention.
Trade-Off:
1. Cache Hit vs. Miss:
  - Hit: When the short URL is found in the cache, the lookup is fast.
  - Miss: If the URL is not in the cache, the system falls back to querying the database, which is slower.
  - Example: If the cache hit ratio drops to 50% due to infrequently accessed URLs, latency increases, and the database may face higher load.
2. Memory Usage vs. Scalability:
  - Redis stores all data in memory, which is expensive compared to disk storage.
  - Example: If we want to cache 1 billion URLs (about 150 GB), the cost of high-memory servers for Redis becomes a concern.
  - Trade-off: We limit caching to the most frequently accessed URLs (~10% of all URLs).
3. Consistency vs. Performance:
  - If updates are made directly to the database (e.g., URL expiry or analytics tracking), the cache may hold stale data temporarily until refreshed.
  - Trade-off: Sacrifice real-time consistency to prioritize performance for redirection requests.

Trade-Off Example: Cache vs. Database

Context:

In the design, we opted for a Redis cache to store frequently accessed URLs and reduce latency.

Why We Chose Redis Cache:

Speed: Redis operates in-memory, enabling near-instantaneous lookups compared to database queries.
Load Reduction: Redirect requests are served from the cache, offloading pressure from the database.
TTL: Redis supports Time-to-Live (TTL), allowing expired URLs to be removed automatically without database intervention.

Trade-Off:

Cache Hit vs. Miss:
- Hit: When the short URL is found in the cache, the lookup is fast.
- Miss: If the URL is not in the cache, the system falls back to querying the database, which is slower.
- Example: If the cache hit ratio drops to 50% due to infrequently accessed URLs, latency increases, and the database may face higher load.
Memory Usage vs. Scalability:
- Redis stores all data in memory, which is expensive compared to disk storage.
- Example: If we want to cache 1 billion URLs (about 150 GB), the cost of high-memory servers for Redis becomes a concern.
- Trade-off: We limit caching to the most frequently accessed URLs (~10% of all URLs).
Consistency vs. Performance:
- If updates are made directly to the database (e.g., URL expiry or analytics tracking), the cache may hold stale data temporarily until refreshed.
- Trade-off: Sacrifice real-time consistency to prioritize performance for redirection requests.

Trade-Off Example: Failure Recovery Mechanisms

Context:

To ensure high availability and fault tolerance, the system should recover gracefully when components fail (e.g., a server crash or cache failure). We incorporated replication and fallback strategies in the design.

Mechanisms for Recovery:

Database Replication:
- Multiple copies (replicas) of the database ensure availability even if one server fails.
- Trade-Off:
  - Benefit: High availability and low risk of data loss.
  - Cost: Increased storage needs and replication overhead. If data needs to replicate across multiple nodes, write latency may increase.
  - Example: Updating a short URL mapping might take milliseconds longer due to replica sync delays.
Cache Fallback to Database:
- If the Redis cache goes down, the system queries the database directly.
- Trade-Off:
  - Benefit: Ensures continuity of service for redirection requests.
  - Cost: Database will experience increased load during cache outages, resulting in higher latency and potential bottlenecks under peak traffic.
  - Example: During a cache failure, redirection latency might increase from 1ms to 10ms.
Load Balancers with Failover:
- Load balancers redirect traffic from failed servers to healthy servers.
- Trade-Off:
  - Benefit: Users don’t notice server outages as requests are rerouted.
  - Cost: Adding failover capabilities increases infrastructure complexity and cost.
  - Example: Keeping additional standby servers can increase operational costs by 20%.
Backups for Disaster Recovery:
- Regular backups of database and metadata ensure recovery in case of catastrophic failures (e.g., data corruption).
- Trade-Off:
  - Benefit: Prevents permanent data loss and ensures the system is recoverable.
  - Cost: Backup systems require extra storage and may not include real-time data due to backup frequency.
  - Example: If backups occur daily, URLs created just before failure might be lost.
Retry Logic and Circuit Breakers:
- Implement retries for transient failures and circuit breakers to avoid overwhelming downstream services.
- Trade-Off:
  - Benefit: Improves reliability for users during intermittent failures.
  - Cost: Retries add latency and may temporarily strain the system.
  - Example: If the database is slow, retry logic might delay redirections by a few milliseconds.