Page MenuHomeSoftware Heritage

Estimate for Kafka cluster specifications
Closed, MigratedEdits Locked

Description

The heart of the journal infrastructure will be an Apache Kafka cluster.

We want 2 replicas for each (partition of each) topic, so we can have a node ("broker") shut down for maintenance without impacting production. More replication is not necessary at least for the first iteration of the production cluster.

We don't have an exact picture of the disk storage requirements yet: we know that we will serialize the data for all nodes in the metadata graph in their canonical form (a hash mapping) using a fairly compact representation (msgpack). This means that the per-row space overhead will be lower than with PostgreSQL, but that the duplication will be higher, notably for directory metadata which will duplicate directory entries.

In sharp contrast to PostgreSQL, Kafka is really intended for sequential reads, and only uses a small index as a primary key to evict duplicate entries in the journal (this eviction index is how updates and deletes are implemented).

I think we should allocate 5-6TB of (raw, unreplicated) disk space, and go from there.

Kafka is now tuned to work well in a JBOD setup: the patches referenced on https://www.slideshare.net/DongLin1/kafka-jbod-linke have all been merged and released now. There's no concrete need for an underlying RAID0 as kafka can handle writing to several disks itself reliably (without taking down the entire broker if only one disk goes away).

Memory requirements for the kafka broker itself are quite low (the current prototype deployment runs fine on a VM with 2GB of RAM). The main requirements for memory come from the buffer cache, which helps avoid hitting disk for reads from all clients.

We'll have two sets of clients: clients that read the full log from the top each time, and clients that will keep up to date with the journal and only read new messages. There's no point in catering for the former type of clients in the buffer cache space, as they'll be out of sync.

We expect to have up to 10 "synchronous" clients in this initial deployment, representing internal clients for the swh infra (archiver, indexer) and external mirrors of the archive. The number of "one-shot" clients will depend on the needs of the day but should remain fairly low (that involves reading the full metadata of the archive after all). By kafka standards, this is a very tiny deployment.

To sum up the resource requirements :

  • 6 TB of raw hard disk space, can be split across devices and nodes, should be fairly contiguous to get good sequential access performance, but it's not massively critical.
  • 4-6 GB of RAM per node, to be split between the usage for the broker (1-2GB) and the buffer cache (3-4 GB)

Event Timeline

olasd triaged this task as Normal priority.Apr 10 2018, 2:01 PM
olasd created this task.
ftigeot added a subscriber: ftigeot.

Adding a relation to T792 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.

We have 3x 1U servers which will also be used for an Elasticsearch cluster.
Sharing hardware with Elasticsearch is generally a bad idea, especially for storage.
I propose the following setup:

  • One separate Kafka instance per server
  • One dedicated 2TB Kafka HDD per server
  • 2GB of JVM memory per Kafka instance

The "buffer cache" is managed by the operating system and, as far as I know, there isn't a way to dedicate some of it to a particular application.
This will be one more shared resource.

After running the backfiller and the journal writer for a while, the following topic sizes (before replication) have been estimated for the current contents of the archive:

Topicsize
swh.journal.objects.content636.6 GB
swh.journal.objects.directory18.6 TB
swh.journal.objects.revision942.5 GB
swh.journal.objects.release6.7 GB
swh.journal.objects.snapshot80.6 GB
swh.journal.objects.origin16.1 GB
swh.journal.objects.origin_visit150.0 GB

This means that the current infra is (ridiculously) undersized.

Followup in T1829.