The heart of the journal infrastructure will be an Apache Kafka cluster.
We want 2 replicas for each (partition of each) topic, so we can have a node ("broker") shut down for maintenance without impacting production. More replication is not necessary at least for the first iteration of the production cluster.
We don't have an exact picture of the disk storage requirements yet: we know that we will serialize the data for all nodes in the metadata graph in their canonical form (a hash mapping) using a fairly compact representation (msgpack). This means that the per-row space overhead will be lower than with PostgreSQL, but that the duplication will be higher, notably for directory metadata which will duplicate directory entries.
In sharp contrast to PostgreSQL, Kafka is really intended for sequential reads, and only uses a small index as a primary key to evict duplicate entries in the journal (this eviction index is how updates and deletes are implemented).
I think we should allocate 5-6TB of (raw, unreplicated) disk space, and go from there.
Kafka is now tuned to work well in a JBOD setup: the patches referenced on https://www.slideshare.net/DongLin1/kafka-jbod-linke have all been merged and released now. There's no concrete need for an underlying RAID0 as kafka can handle writing to several disks itself reliably (without taking down the entire broker if only one disk goes away).
Memory requirements for the kafka broker itself are quite low (the current prototype deployment runs fine on a VM with 2GB of RAM). The main requirements for memory come from the buffer cache, which helps avoid hitting disk for reads from all clients.
We'll have two sets of clients: clients that read the full log from the top each time, and clients that will keep up to date with the journal and only read new messages. There's no point in catering for the former type of clients in the buffer cache space, as they'll be out of sync.
We expect to have up to 10 "synchronous" clients in this initial deployment, representing internal clients for the swh infra (archiver, indexer) and external mirrors of the archive. The number of "one-shot" clients will depend on the needs of the day but should remain fairly low (that involves reading the full metadata of the archive after all). By kafka standards, this is a very tiny deployment.
To sum up the resource requirements :
- 6 TB of raw hard disk space, can be split across devices and nodes, should be fairly contiguous to get good sequential access performance, but it's not massively critical.
- 4-6 GB of RAM per node, to be split between the usage for the broker (1-2GB) and the buffer cache (3-4 GB)