Note

Recommended to read Part-7.

Introduction

We have seen vertical and horizontal scaling of our web-servers. How about our database server? Let’s say we have a database server with 24TB of RAM (memory), this is huge but eventually with millions of customers visiting your site everyday, it might not be enough one day.

Horizontal Scaling – Sharding

Sharding is a technique to store data in smaller databases (called shards) instead of one giant database in such a way that the schema of the data is same across all the shards. Each shard acts as an independent database server. This is also called horizontal scaling of database. It is very similar to scale-out horizontal scaling technique in web servers. See here.

The key terms to learn about sharding are –

  • Sharding Key – also known as ‘partition key‘ is simply a value that identifies a shard uniquely. It is generally a string or an integer.
  • Sharding Algorithm – When a data insert/update or retrieve request is raised, how does the web-server understand which shard to access. This is determined by a sharding algorithm which will take the request and that algorithm will use this request to fetch the corresponding shard key to identify the shard.

Example

Let’s take an example – We want to store data of a customer. Following is the schema of customer table.

Let’s say we have 5 shards. Each shard will have the same schema and let’s say our sharding algorithm is customer_id % 5 (modulo 5). This means, our shard key will be calculated –

shard_key = customer_id % 5

  • Insert Operation – Let’s say a new customer comes in, depending upon the logic of new customer_id generation (let’s say auto-increment by 1), the new customer_id is 126. Therefore this record will be stored in shard = 1 (126 % 5)
  • Retrieve Operation – If we query for Bob (with customer_id = 125), we can find out which shard is the data for Bob is stored using sharding algorithm (125 % 5 = 0). It’s shard with shard_key = 0.

Benefits of Sharding

  • Scalability – One of the easy benefits we can see using sharding is scalability. We can keep adding more shards and have more data stored.
  • Prevention of SPOF – With sharding we prevent single point of failure. One huge database server crashing means the entire system is impacted. If one shard is down, only the data present in that shard will be unaccessible. We can solve this issue by replicas. We can create replica shards for redundancy.
  • Cost effective – Adding one shard is less expensive than buying a bigger database with 2X or 4X the capacity.

Sharding is the best solution ! Not really.

Sharding solves many database problems but it brings more problems along with it. Some of them are –

  • Re-sharding – The above example is too simple but in real-life handling data is complex. If we try to add one more shard to our existing design, how will sharding algorithm work now? How will sharding algorithm determine the new shard. If we take above example, sharding algorithm of module 5 only worked for 5 shards, what if we add 6th shard? If we make changes to our sharding algorithm, everything will dis-orient as data and shard keys are tightly coupled. Finding a perfect sharding algorithm which can handle re-sharding is not an easy task.
  • Hotspot Key Problem – Let’s say you are storing all time movies data in shards and top 10 most popular movies (Harry Potter, Pirates of the Carribbean, etc.) end up in one shard only. Now most of the request will arrive at this shard therefore exhausting the shard. This is also called Celebrity Problem. The sharding algorithm should be able to uniformly distribute the data across the shards and must consider each such hotspot key. This problem can be solved using Consistent Hashing technique.
  • Joins and Normalisation – Let’s say you have two tables – customers and orders. If you want to show all such orders which are placed ‘today’ by customers from country ‘India’ and show the collective data, we use ‘joins‘ but if orders table and customers data are in different shards, joining tables across shards is a problem. We need to perform further operations and logics to handle such cases.

Finally, with sharding, our design now looks like this –

Thanks for reading !

Leave a Reply

Your email address will not be published. Required fields are marked *