swh.journal silently loses large objects instead of rejecting them
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Apr 6 2020, 10:22 PM

Description

Thanks to @seirl's export of the graph using swh.journal, we've noticed that large objects (directories with lots of entries, revisions with lots of parents, snapshots with lots of branches) are missing from the journal.

After analyzing the missing directories, here's some stats:

directories present in old export, missing in new export: 103405
directories actually archived in Software Heritage: 27046
- I guess the discrepancy comes from the fact that the export is a list of "nodes referenced by edges" rather than an actual list of fully archived nodes.
minimum size: 1 entry (for 3179 directories)
maximum size: 881720 entries (for a single directory)

There's clearly a skew towards "large" directories. FOllows a CSV containing the breakdown of number of missing directories by number of entries

directory_stats.csv283 KBDownload

After some examination, the underlying bug has been found: when developing the journal writer, I thought that the non-persistence of a message would be a fatal error in the kafka producer, which would be reported by the default error callback. It turns out that, if you need reliable message delivery, you need to explicitly service the message delivery callback.

The most kafka error, save from transient issues, is "message size too large". It turns out that every instance of this has been a message dropped on the floor, rather than a retry on the backend.

The default max message size for our Kafka brokers is around 1 MB.

This issue pulls on a bunch of threads:

we need to make sure swh.journal listens for kafka message deliveries, and rejects additions where some objects are not acknowledged.
we need to find a way to handle large messages in swh.journal. after a short discussion, the following two possibilities have come up:
1. increase the max message size on kafka brokers (for messages inbound from producers, as well as for replication).
2. find a way to split objects in several messages (f.e. https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297)
we need to double-check the contents of our current journal topics, and backfill the objects that have been rejected from kafka while being properly ingested in PostgreSQL.

There's also the long(er) term question of setting explicit object size limits in the archive, and storing that information rather than failing an import midway. I guess this would be a usecase for T1957.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2348 swh.journal silently loses large objects instead of rejecting them
Migrated	gitlab-migration	T2349 Make the journal writer reliable
Migrated	gitlab-migration	T2351 Consider backfilling mistakenly rejected large objects from PostgreSQL
Migrated	gitlab-migration	T2350 Support large messages in swh.journal / kafka

Event Timeline

olasd triaged this task as High priority.Apr 6 2020, 10:22 PM

olasd created this task.

olasd closed subtask T2349: Make the journal writer reliable as Resolved.Apr 15 2020, 10:15 AM

olasd changed the status of subtask T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL from Open to Work in Progress.Apr 15 2020, 10:26 AM

olasd closed subtask T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL as Resolved.Apr 28 2020, 11:24 AM

The kafka producer in swh.journal now reads message receipts and fails if they're negative, or if they didn't arrive within two minutes.

olasd closed subtask T2350: Support large messages in swh.journal / kafka as Resolved.Apr 28 2020, 11:28 AM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T2349: Make the journal writer reliable from Resolved to Migrated.Jan 8 2023, 4:30 PM

gitlab-migration changed the status of subtask T2350: Support large messages in swh.journal / kafka from Resolved to Migrated.

gitlab-migration changed the status of subtask T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL from Resolved to Migrated.

swh.journal silently loses large objects instead of rejecting themClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

swh.journal silently loses large objects instead of rejecting them
Closed, MigratedEdits Locked
Actions

Related Objects
Search...