The Trouble With Cassandra: Why It's a Poor Choice For a Metadata Database for Object Stores
Cassandra is a popular, tried-and-true NoSQL database that supports key-value wide-column tables. Like any powerful tool, Cassandra has its ideal use cases - in particular, Cassandra excels at supporting write-heavy workloads, while having limitations when supporting read-heavy workloads. Cassandra's eventual consistency model and lack of transactions, multi-table support like joins, subqueries can also limit its usefulness.
However, using Cassandra as a metadata database for an object storage system introduces significant complexity resulting in data integrity and performance issues at scale - particularly if one wants to use their object store as a primary storage system. Object storage needs are far simpler and different from what Cassandra is built for.
Because the implications of employing Cassandra as a object storage metadata database were not properly understood, many object storage vendors made it a foundational part of their architecture - unfortunately it keeps them from ever moving past simple archival workloads into the modern workloads that define the future of object storage (AI/ML, analytics, web/mobile applications).
Let’s explore why in a little more detail.
- Cassandra was never designed to manage file or object storage metadata and it is predictably weak in this regard. It is not ACID compliant. It does not have the rigidity to prevent partially successful writes, dupes, contradictions and the like. Cassandra does not support joins or foreign keys, and consequently does not offer consistency in the ACID sense. Further, there is no capacity to roll back transactions in the event of a failure.
While Cassandra supports atomicity and isolation at the row-level, it trades transactional isolation and atomicity for high availability and fast write performance. - Cassandra is categorized as an AP system in CAP. Meaning it trades Consistency for Availability and Partition tolerance. When employing Cassandra as a metadata database for an object store, you can either be fast or consistent - but not both at the same time.
Cassandra’s tunable consistency is a compromise, not a feature. Any setting other than QUORUM or ALL means you are at risk of reading stale data. It is important to apply this consistency setting for both read and write operations in addition to the object data operations performed outside of it.
In the object storage world, the implication is that you can be good for archival use cases (write once, read very infrequently) or you choose a different architecture. - Similar to the consistency problem, durability guarantee is also a tradeoff between performance and correctness. The storage engine’s default commit log is set to sync periodically every 10 seconds. This means you will lose up to 10 seconds worth of latest updates in the event of power failure. The only reasonable way to make Cassandra durable is to use the synchronous batch mode committer which comes with a performance penalty.
- Cassandra’s high-availability guarantee is not suited for erasure coded object stores. With a replication factor of 3 and consistency quorum of 2, Cassandra can only tolerate a single node / drive failure within a replication group. Increasing the replication factor and quorum consistency to 5 or higher serves only to make the meta performance go from bad to worse. Unlike replication, erasure coding can tolerate multiple servers and drives failures in a distributed system. Even if you have configured the erasure code setting to 6 parity (any 6 nodes may fail) in a 16 node setup, you are still limited by the weak link, i.e Cassandra’s replication factor. The ops team is often unaware of these high-availability surprises until it is too late.
- Object storage systems organize the data in a tree structured hierarchical namespace. Since Cassandra does not support a hierarchical key namespace, you will have to build a tree data model on top for each directory prefix and also maintain a flat list for direct lookups without directory walk. Atomically updating multiple tables with batched commit log and full read / write quorum is slow and prone to corruption.
- While objects themselves are immutable, the object storage system is mutable. When you add, remove, overwrite objects and its metadata, apply policies, collect metrics, grant session tokens and rotating credentials, the metadata is always mutating. Cassandra is not designed to handle this level of metadata mutation and definitely not for the primary storage workloads. Long term archival use cases where the objects are large (GBs in size) and infrequently accessed, will work - other use cases will not..
The reason is that Cassandra's log structured storage system quickly appends new writes to the end of the log file, but delays the deletes and overwrites with a tombstone marker. Vacuuming these tombstones is an expensive operation, because the actual delete operation is applied by copying the SSTables to a new table sieving the stale entries in the process. This operation has to be performed on all the nodes simultaneously. If you delay vacuuming, excessive tombstones will result in increased read latencies, memory GC pauses and failed queries. Some object storage vendors use an additional Redis database to offload Cassandra’s pressure. Using two databases to manage an object stores metadata is hardly elegant and introduces additional points of failure.
The biggest gotcha? You won’t see these problems until you are deep into production and it is too late. - Small objects (KB to MB in size) will fill up the metadata drives dedicated to Cassandra much sooner than the data drives. Also small object workloads exacerbate Cassandra’s limitations, because they are sensitive to latency and consistency issues. Some vendors store small objects entirely inside Cassandra to address this problem. At this point, you are merely looking at an S3 proxy on top of Cassandra database.
This too is a bad practice.
If you use your object store for large objects and employ erasure coding and use Cassandra as your data store for small objects and use replication - you have introduced a non-trivial SLA problem. In this approach, data is protected by different guarantees. Given that drives die all the time, the probability of serving an old object or a corrupted object goes up considerably.
As noted above, your metadata database is now the weak link. Availability, consistency and durability guarantees are only as good as the weakest link. If the weakest link employs replication (three copies) you can only withstand one-node or one drive failure before losing data.
A counter argument might be to replicate five copies. The result is a massive performance hit and you can still really only withstand two-node or two-drive failure.
In using replication for small objects and erasure coding for large objects you also undermine the efficiency gains associated with EC. If you only use erasure code for large objects (likely a small percent of your overall object pool) you don’t gain much but increase your exposure considerably. - Employing Cassandra as your metadata database for an object store also introduces a troublesome Java dependency. This in turn can result in bloatware and memory management issues. Cassandra taxes the JVM memory management with constant large scale metadata allocation and mutation resulting in memory exhaustion and garbage collection pauses.
The obvious takeaway is that it is a lot more complicated to operate a Cassandra cluster than a properly designed object storage system. Cassandra is built for a different purpose and object-storage meta-data is not one of them. The areas where Cassandra struggles are the areas that are core to a performant, scalable and resilient object store.
The last point is of note - object storage is a natural fit for blob data and that is why Erasure Coding is so effective and efficient. Cassandra is designed for replication. When you use that model for metadata it breaks the object store’s erasure coding advantage (or at the very least makes it brittle and prone to breakage).
Bottom line. Write your metadata atomically with your object. Never separate them.
We welcome your comments. Feel free to engage us on Twitter, on our Slack channel or by dropping us a note at hello@min.io.