Page MenuHomeSoftware Heritage

Support large messages in swh.journal / kafka
Closed, MigratedEdits Locked

Description

Some objects in the Software Heritage archive are large. As it turns out, even larger than the default max message size in kafka.

The max message size in kafka is set pretty small (1MB by default) as increasing it increases the memory requirements for brokers, if lots of large messages are processed.

We've come up with two choices:

Consdering the low amount of large objects, increasing the max message size is a quick fix that shouldn't have many downsides, as long as we monitor the impact on kafka brokers.

Splitting large messages has several implications, notably with respect to partitioning and compaction, which means that any solution should be carefully considered.

I'll check the size of some currently "stupidly large" objects, and will bump the max message size accordingly in kafka. A quick test shows that bumping the max message size to 100 MB lets through a synthetic directory with a million entries.

Event Timeline

olasd triaged this task as High priority.Apr 6 2020, 10:31 PM
olasd created this task.
olasd claimed this task.

We've bumped the max message size to 100 MB in all producers.

Compression has been enabled in the mirrormaker config that sends objects from the Rocquencourt cluster to Azure, and on the id backfiller which handled T2351.

After some empirical tests on the directory topic, consumers now need to set a max message size to 500 MB to be able to read and decompress all objects (else, librdkafka complains about corrupt compression; when enabling debugging, it's because the decompressed message is too large to fit in the receive buffer).

We should be able to enable compression on the main producer too, now.