System Design I — TinyURL

TinyUrl is a URL shortening service. In this service, user can 1) get a short alias of a long URL. 2) retrieve the original Url with the shorten alias.

One example of this service is https://tinyurl.com/.

As a system design question, how do we approach it?

Requirements

What is the customers? Anyone with urls.

What is the service?

  1. convert a URL to a short one.
  2. retrieve the original url given the shortened alias.
  3. the shorten url expires after 3 years.

Scalability: How many daily visits? The tinyurl has 40M daily visit. As an interview question, assume the daily visit is 40M.

API Design

Use Apache thrift

service TinyURL {
string makeTinyUrl(string url);
string retrieveOriginalUrl(string tinyUrl);
}

Database Design

Should we use SQL or NoSQL database? NoSQL. Because we NoSQL provide the interface to store and load key-value pairs and NoSQL scales better than SQL. We can use Cassandra database

CREATE TABLE tinyurl_to_originalurl(
tiny_url varchar,
original_url varchar(2048),
creation_timestamp int,
PRIMARY KEY (tiny_url)
);

Algorithm

How do we shorten urls?

Option 1: hash-based method

  1. Hash (MD5 or SHA256) of the url?
  2. encode the hash with base62 [A-Za-z0–9]
  3. take the first k character of the hash as ${hash_str}
  4. The tiny url is https://tinyurl.com/${hash_str}

The k is determined by the number of unique urls. Given the requirement, it is 40M * 365 * 3 = 43.8 Billion.

k=5 letters can represent 62⁵=916.1M strings

k=6 letters can represent 62⁶=56.8 Billion strings

Thus we set k = 6.

However, hash collision is possible.

Option 2: counter-based method

To avoid collision, we can use an auto-increase ID encoded by base62 as the shorten URL. In a distributed environment, we leverage ZooKeeper to implement the auto-increase. We divide the int space into segments as 0–999999, 1000000–1999999, … and stored it in Zookeeper. The host claims a unused range from Zookeeper and marks the range used, when it is added to the tinyurl service or runs out of the current range. However, this option is not secure as hackers can access the shorten urls not generated by or shared with them.

Option 3: key generation service

A better solution is delegating the collision avoidance work to Key Generation Service.

System Architecture

DB

Now we can estimate the storage space. The longest URL is 2KB, and our shorten url is 26B, so one row is at most 2048 + 26 + 4 = 2078B. We can have at most 43.8 Billion, so the storage needed is 91TB. We can partition the database into n shards (hosts), and store the key in the hash(key) MOD n host.

Server and Cache

We can also estimate the QPS, write QPS 40M / day = 462. For read, assume we will retrieve a url 10 times by average, the read QPS is 4620. We can use distributed cache e.g. memcached, to save resource, assume the cache hit rate is 80%, the read QPS requiring DB access 4620 * 0.2 = 924 QPS. Assume the max QPS of a host is 200. We need 7 hosts.

System architecture

Reference

https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR

https://en.wikipedia.org/wiki/TinyURL

https://tinyurl.com/

https://medium.com/@narengowda/url-shortener-system-design-3db520939a1c

https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm

Software Engineer@Facebook

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store