Page MenuHomeSoftware Heritage

Define the requirements for an on-premise Cassandra cluster
Closed, MigratedEdits Locked

Event Timeline

vlorentz triaged this task as Normal priority.Mar 5 2021, 12:35 PM
vlorentz created this task.

Summary of a discussion on 2021-01-05, on using "HDD+fully loaded in RAM" vs "SSD":

The expected size of the database on-disk, with compression and without replication is 5TB.

Very roughly, this means that if we want it to fit in RAM, the RAM usage would be around 10TB, so 30TB post-replication. Computation from @olasd says it's 528k€ worth of 128GB sticks, or 250k€ worth 64GB sticks. And this means respectively 8-15 or 15-30 servers, as a server can hold 16 or 32 sticks.

And that's not even account for extra RAM to plan for growth and migrations. A priori that's too expansive, so that option is out for now.

This means we need SSDs to store the data, as the read workload is almost entirely random.
So, at least 15TB of SSDs post-replication, and let's double it to allow for some growth + extra space needed while migrating data, that's a minimum of 30TB of SSD

In terms of RAM, we currently have a 1/20 ratio to the storage for the postgresql storage. If we want to keep the same ratio, that's 1.5TB of RAM for the cache. We also need at least 32+8GB/server for Cassandra itself, which is negligible. So that's 1.5TB of RAM total, which is more reasonable; so assuming 64GB sticks (because cheaper), that's 24 sticks, so we only need two servers to hold that much RAM.

But that's not enough for reasonable replication (1 main and 2 copies), so we need at least 3 servers at any time.

So, we need 30TB of SSD and 1.5TB of RAM, spread across 3 servers, which means 3 servers, with 10TB of SSD and 0.5TB of RAM per server.

Now, we need some hot spare, so at the very least, 4 servers with these specs.

Also, it was implicit in my previous comment, but replication would be done entirely at the Cassandra level, so no RAID. Every Cassandra documentation discourages RAID (other than RAID0), as a a Cassandra server has no issue using multiple directories each mounted from a different disk.

So in summary, the minimal requirements, allowing for replication + migrations + a little growth + hot spare:

  1. 4 servers
  2. 0.5TB of RAM per server, and it should have ECC
  3. 10TB of SSD per server, JBOD. They should probably be NVMe (IIRC, NVMe SSDs are the same price as SAS SSDs)
  4. gigabit router/switch between the servers
  5. we won't need to add more hardware inside these servers after they are productionized, instead we will add other servers with similar specs as we grow

Benefits from increasing each of these specs:

  1. Spreading the same specs across more servers means it's less expensive to add one more, but also needs more rack space
  2. More RAM -> more cache + more room for the GC -> faster
  3. Bigger disks -> more room for growth. More disks -> more room for growth + possibly faster as it spreads the load
  4. I don't know if a gigabit router/NIC would be a bottleneck.

Did you consider PMem (and other configurations for Intel Optane memory) in your discussion? It offers a very interesting price/performance ratio.
There are machines on Grid5000 available to test this technology if needed.

@rdicosmo I have not, good idea. While they are probably too expansive to use as the main storage instead of SSDs (either via a regular FS or by using a Pmem-aware Cassandra fork), we could use them in addition to the above requirements.

For example, just for the FS journal, which we already do for the current objstorage IIRC.

Cassandra also has its own journal (commitlog_directory). The documentation even says HDDs are fine for this directory, but Pmem would probably improve the write latency, I guess? (And we sure do a lot of of small commits).

So in short, it's not clear to me what the gains are (and I don't have time to check them); but we could add as a "soft requirement" on the servers that they should have a couple of Optane slots, so that we have room to upgrade in a couple of years if needed.

Let's organise a call next week to explore the options, including the new opportunities of testing that emerged recently.