Tenets of Cassandra
- Distributed storage system to store structured data
- Horizontally Scalable – Spread across multiple servers
- No Single Point of Failure
- Highly Available
- Highly Scalable
- Not a relational DB.
- Architecture designed to provide high write through-put without sacrificing read efficiency.
Facts
- Facebook developed Cassandra.
- Cassandra was designed to fulfil the storage needs of the Inbox Search problem.
- Cassandra was developed at Facebook by Avinash Lakshman and Prashant Malik.
- Avinash Lakshman also co-authored Dynamo (not DynamoDB). Dynamo became the precursor for DynamoDB.
Data Model
- Cassandra is NOT a relational database.
- It has multiple components that form the data model –
- Column
- Row
- Column Family
- Super Column
- Keyspace
- Let’s understand each of them with a diagram –
Understanding the diagram inside-out
- Column: It is the smallest piece of data stored as a key-value pair. Column key is the name of the field we want to store and Column Value is the actual value stored for that field. For e.g. – { firstName → John }
- Row: It is the collection of columns. Multiple columns together form a row. Each row is identified with a row key. For e.g. – [123] → [ { firstName → John }, { lastName → Doe } ]. In this example 123 is the row key which fetches columns firstName and lastName.
- Column Family: All rows together form a Column Family.
- Keyspace: Multiple Column Families form a keyspace.
The analogy with relational can be seen like –
Cassandra | Relational DB |
Column Key | Field |
Column Value | Value |
Column | Cell |
Row | Row |
Column Family | Table |
Keyspace | Database |
Cassandra is schema less unlike Relational DB. Let’s see a real-life example of how a data model looks like in Cassandra.
The easiest analogy to this is like storing the data in a multi-dimensional map. The above data can be seen as –
- Map<RowKey, SortedMap<ColumnKey, ColumnValue >> users.
In case we have multiple columns that are always fetched together and written mostly together, then they can be grouped together into a super column. For example, the name of the user is mostly required together (firstName + lastName), hence these two columns can be grouped to form a super column.
- Be careful with super columns because they consume memory and retrieve all the columns under a super column.
- The row key in a table is a string with no size restrictions, although typically 16 to 36 bytes long.
- Applications can specify the sorting order of columns or within super columns. The system allows sorting of columns based on name or time.
- Time sorting of columns is exploited by applications like Inbox Search where the results are always displayed in time sorted order.
Cassandra storage hierarchy.
Cassandra is a distributed data storage system hence it stores the data across multiple data servers. These servers are physical servers and are called nodes. Each node has a disk storage capacity of 3TB to 5TB.
Node
Physical database server where cassandra stores and retrieves data.
Racks
These bunch of nodes are located in a data center. Inside a data center, nodes are logically grouped as racks. Note that it’s a logical grouping and not an actual physical distinction between nodes.
Data Centers
Nodes are stored physically in data centers. These data centers span across multiple geographical regions around the world.
Cluster
Till now we know data is stored in nodes but nodes need to interact and communicate with each other since the data is stored in partitions and replication has to take place between nodes. Hence multiple nodes together form a group which interact with each other to server reads and writes. This group is called a cluster. Even though a cluster of cassandra is like a logical grouping, these nodes are also physically connected, sometimes across data centers.
To summarise this is how storage looks like.
- Notice how all the nodes are in a data center.
- Each node is a part of a rack but it is shown as a dotted line meaning it is a logical grouping.
- There are two data centers shown – one in the US and another in Australia.
- See how a cluster can be formed across data centers.
Partitioning
Cassandra is built for scaling hence it should be able to partition the data dynamically over the set of nodes across a cluster.
Arrangement of Nodes
Data is stored in nodes and cassandra uses consistent hashing to assign every row or data item to the corresponding node(s). This consistent hash function takes the key of the data item and produces a value (also called a token). For all such keys, the hash function’s output results in a range of tokens. This token range has the min and the max value and this output range is logically wrapped around to form a fixed circular ring. See below –
Now, each node in the system is randomly assigned which specifies the position of that node in the ring. See below –
In the above image, the hash function’s range is between 0-500. Nodes 1, 2, 3 are assigned a random position on the ring with
- Node 1 → 432
- Node 2 → 268
- Node 3 → 129
Assignment of data in the Node
As we know that each row has a row key which is a string and this row key is used to determine its position in the node. Since this row key is used to partition the data in the cluster, it is also called a partition key. The partition key is passed to the consistent hash function and a token is returned. This token corresponds to a value on the ring. Now we take this value and walk clockwise on the ring and whichever node we meet first, that is the node the data is stored in.
See example below –
We are storing rankings of all cities around the world. City is the partition key.
A new entry for Paris is supposed to be stored in the node. Following steps are performed to determine which node will store the entry.
- The partition key is picked to determine the node in which the row will be stored.
- PK will be passed into the hash function to find the token.
- For Paris, the hash function returned 246 as the token.
- This token value corresponds to a position in the ring.
- We take this value and walk clockwise in the ring, whichever node we encounter first is the node we will store it.
- We find Node 2 which will store the row with the partition key Paris.
Problems with consistent hashing
Cassandra uses consistent hashing to partition the data in nodes across a cluster. Consistent hashing is useful with addition and removal of nodes. When nodes are added or removed from the ring, only the neighbouring/adjacent nodes are impacted. Number of requests that are supposed to be rerouted in consistent hashing on the addition and removal of a node will be optimal.
However, there are 2 main problems with consistent hashing –
- Random placement of nodes in the ring leads to unequal distribution of load on the nodes.
- Basic algorithm does not take into consideration the performance of the nodes. This would have otherwise helped in making an informed decision on storing the data.
To resolve these issues, there are 2 ways to do it –
- We assign a node multiple positions in the ring.
- We analyse the load distribution on the ring and move the “lighter nodes” to alleviate (reduce the load of) “heavier nodes”.
Cassandra uses the 2nd option as resolution.
Key points to note –
- Each node has a position in the ring but it serves a number of requests hence each node is responsible for serving a range of token values.
- Row key and the partition key are the same thing. The key in the table is called the partition key because this key is used to determine the partition (node) in which the row is stored.
- Token is just a logical value used to determine the node in which the data is stored; it is NOT used to store the data inside the node.
System Architecture
- Nodes are P2P (peer-to-peer) which means there is no master or follower and each node can do everything that other nodes can do.
- Nodes are logically grouped over a circular ring and each node is assigned a token range. Data is partitioned and stored on these nodes and consistent hashing is used to route requests to the correct node.
- Every read or write request has the partition key which is used to determine to which node the request is routed to.
- A hash function is used to convert the partition key into the token and using that token we will determine the node in which the request is supposed to be routed.
- Each node is assigned a range of tokens and for a given token we move clockwise in the circular ring and the next node in that direction is the node to which the request is routed. This node that handles the request is called a coordinator node.
- As per the above example, Node 2 is the coordinator node.
Replication
We have seen how Cassandra scales the data by simply adding the nodes. More nodes means more storage. With consistent hashing we have also seen how partitioning happens in cassandra. One of the other tenets is no single point of failure. This is achieved by replication. Imagine if one of the nodes goes down, then all the data stored by that node will be lost. However, cassandra uses replication to avoid such conditions.
Every data item (row) is not stored in only 1 node but it is replicated in multiple nodes. The number of nodes which will contain each piece of data is called replication factor. If the replication factor (N) is 2, it means 2 copies of the data exist in the cluster. It means 2 nodes will have a data item. When a record reaches the coordinator node, that node will replicate that data to N-1 nodes that fall in its range. See example below –
In the above example, Node 2 is the coordinator node and as the request for saving the record comes, the coordinator node first stores the data in itself and then it replicates the row in two other nodes next to it (Node 4 and Node 1). In total 3 copies are stored for the ‘Paris’ record.
In case, node 2 is down, Node 4 and 1 still have the data.
Replication Strategy
Cassandra uses multiple replication policies to replicate the data –
- Rack Unaware
- Rack Aware (within a data center)
- Data Center Aware
The example we used above uses the ‘Rack Unaware’ policy in which the coordinate node chooses the next successor nodes for replication. In case of 2 and 3, the algorithm is more complex.
Cassandra chooses a leader amongst the nodes using ‘Zookeeper’. A new node on joining the cluster contacts the leader to find out the range of replicas that the node is responsible for storage. The leader ensures that no node is responsible for N-1 replicas (where N is the replication factor). Furthermore, Zookeeper also caches the metadata of each node and the range of rows that node is responsible for storage. This helps in a way that if a node goes down and resumes back, it knows which range of data it was maintaining. Each node is aware about the replicas that every other node is maintaining.
Consistency in Cassandra (Quorum)
Cassandra guarantees eventual consistency (also configurable) and provides tunable consistency levels for reads and writes. One such consistency level is Quorum, which ensures that a majority of replicas respond to a request.
When a write occurs for a row in a node, the replicas are supposed to be updated based on the replication factor (RF). If a write to some replicas fails, Cassandra ensures eventual consistency through mechanisms like hinted handoff or repair. During a read, the request is sent to a subset of replica nodes based on the specified consistency level (e.g., QUORUM).
The replica nodes may return different responses, and Cassandra resolves this by selecting the most recent value based on the timestamp of each replica. For example, if RF = 4 and the quorum = 3, it means the data is stored in 4 nodes, and a quorum of 3 nodes must respond to satisfy the read or write request. If the quorum is not met for a request with consistency level = QUORUM, Cassandra returns an error to the user. For writes, if enough replicas respond successfully to satisfy the specified consistency level (e.g., quorum), the write is considered successful.
Other tunable consistency levels in Cassandra are ALL and ONE.
- ALL: Read and write requests are considered successful only when all replicas for the given data send successful responses. This ensures strong consistency, as it guarantees that all replicas have the same data before responding to the client. However, this level comes with higher latency and lower availability, as even a single replica failure can result in a request failure.
- ONE: Read and write requests are considered successful when any one replica sends a successful response. This ensures weak consistency, as other replicas may not yet have the latest data. This level prioritises low latency and high availability but sacrifices consistency.
Read and Writes in Cassandra
For writes, the system routes the requests to the replicas and waits for a quorum of replicas to acknowledge the completion of the writes.
For reads, based on the consistency level selected by the client, the system either routes the requests to replicas and waits for a quorum of responses.
See you in the next chapter.
References
https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
https://cse.iitd.ac.in/~srsarangi/distbook/docs/cassandra.pdf
https://cassandra.apache.org/doc/latest/cassandra/developing/data-modeling/intro.html
Read more on
Consistent Hashing – https://techshshila.in/all-you-need-to-know-about-consistent-hashing-with-code-implementation/