Before we discuss sharding, let's talk about data partitioning:
Data partitioning is a technique to break up a database into many smaller parts. It is the process of splitting up a database or a table across multiple machines to improve the manageability, performance, and availability of a database.
There are many different ways one could use to decide how to break up an application database into multiple smaller DBs. Below are two of the most popular methods used by various large-scale applications:
Horizontal Partitioning (or Sharding)
In this strategy, we split the table data horizontally based on the range of values defined by the partition key. It is also referred to as database sharding.
Vertical Partitioning
In vertical partitioning, we partition the data vertically based on columns. We divide tables into relatively smaller tables with few elements, and each part is present in a separate partition.
In this tutorial, we will specifically focus on sharding.
Sharding is a database architecture pattern related to horizontal partitioning, which is the practice of separating one table's rows into multiple different tables, known as partitions or shards. Each partition has the same schema and columns, but also a subset of the shared data. Likewise, the data held in each is unique and independent of the data held in other partitions.
sharding
The justification for data sharding is that, after a certain point, it is cheaper and more feasible to scale horizontally by adding more machines than to scale it vertically by adding powerful servers. Sharding can be implemented at both application or the database level.
There are a large number of criteria available for data partitioning. Some most commonly used criteria are:
This strategy divides the rows into different partitions based on a hashing algorithm rather than grouping database rows based on continuous indexes.
The disadvantage of this method is that dynamically adding/removing database servers becomes expensive.
In list-based partitioning, each partition is defined and selected based on the list of values on a column rather than a set of contiguous ranges of values.
Range partitioning maps data to various partitions based on ranges of values of the partitioning key. In other words, we partition the table in such a way that each partition contains rows within a given range defined by the partition key.
Ranges should be contiguous but not overlapping, where each range specifies a non-inclusive lower and upper bound for a partition. Any partitioning key values equal to or higher than the upper bound of the range are added to the next partition.
As the name suggests, composite partitioning partitions the data based on two or more partitioning techniques. Here we first partition the data using one technique, and then each partition is further subdivided into sub-partitions using the same or some other method.
But why do we need sharding? Here are some advantages:
Here are some reasons why sharding might be the right choice: