Thanks to @seirl's export of the graph using swh.journal, we've noticed that large objects (directories with lots of entries, revisions with lots of parents, snapshots with lots of branches) are missing from the journal.
After analyzing the missing directories, here's some stats:
- directories present in old export, missing in new export: 103405
- directories actually archived in Software Heritage: 27046
- I guess the discrepancy comes from the fact that the export is a list of "nodes referenced by edges" rather than an actual list of fully archived nodes.
- minimum size: 1 entry (for 3179 directories)
- maximum size: 881720 entries (for a single directory)
There's clearly a skew towards "large" directories. FOllows a CSV containing the breakdown of number of missing directories by number of entries
.After some examination, the underlying bug has been found: when developing the journal writer, I thought that the non-persistence of a message would be a fatal error in the kafka producer, which would be reported by the default error callback. It turns out that, if you need reliable message delivery, you need to explicitly service the message delivery callback.
The most kafka error, save from transient issues, is "message size too large". It turns out that every instance of this has been a message dropped on the floor, rather than a retry on the backend.
The default max message size for our Kafka brokers is around 1 MB.
This issue pulls on a bunch of threads:
- we need to make sure swh.journal listens for kafka message deliveries, and rejects additions where some objects are not acknowledged.
- we need to find a way to handle large messages in swh.journal. after a short discussion, the following two possibilities have come up:
- increase the max message size on kafka brokers (for messages inbound from producers, as well as for replication).
- find a way to split objects in several messages (f.e. https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297)
- we need to double-check the contents of our current journal topics, and backfill the objects that have been rejected from kafka while being properly ingested in PostgreSQL.
There's also the long(er) term question of setting explicit object size limits in the archive, and storing that information rather than failing an import midway. I guess this would be a usecase for T1957.