Page MenuHomeSoftware Heritage

Allow object removal from journal
Open, NormalPublic


swh.journal is currently missing a way to remove objects from the journal. This would be required to properly process some takedown notices.

Implementation-wise, in Kafka this is part of the log compaction mechanism:

Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. Such a record is sometimes referred to as a tombstone. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space.

Should a delete_object(object_type: str, object_key: KeyType) method be added to swh.journal.writer.interface.JournalWriterInterface?

Event Timeline

vlorentz triaged this task as Normal priority.Oct 27 2022, 11:32 AM

heh that's funny, olasd and I also discussed this yesterday (though we were interested in changing objects rather than deleting):

09:06:17 <+vlorentz> so anyway, the reason I was checking this out, is that we could temporarily set to a low value on to trigger a compaction, so we don't need to change the replayer code for
09:06:17 -- Notice(swhbot): T4644 (submitter: swh-sentry-integration, owner: vlorentz, priority High, status: Open): replayer crashes on invalid directory entry name (b'gitter/gitter.xml') <>
09:06:42 <+vlorentz> if I understand all this correctly
09:07:20 <+olasd> yeah, that's plausible
09:08:00 <+olasd> we should probably set to a low value anyway, there's multiple areas where we actually want frequent compaction
09:08:07 <+olasd> (e.g. takedowns)

Should a delete_object(object_type: str, object_key: KeyType) method be added to swh.journal.writer.interface.JournalWriterInterface?

yes, I think that would make sense. is the process I've followed for a (lower stakes) removal of buggy raw_extrinsic_metadata messages from swh.journal (in staging).

In that specific situation the schema for object keys in kafka had changed (and we hadn't cleaned kafka up after the change), so it made sense to use a journal client to figure out the keys of the objects that needed removal. It probably makes sense to support both options (either removal through known message keys, where you only need to instantiate a producer, or removal using a kafka client with a message matcher).

After discussing D8833 and discussing the general API with @olasd, it feels like adding a delete method to swh.journal.writer.interface.JournalWriterInterface is the wrong approach. To try to capture our discussion

The journal operation is closely aligned with Kafka’s, so deleting data from the journal actually means laying out a tombstone in the journal. We should not rely on consumers of the journal actually seeing these tombstones to handle deletions. If they lag too much, the tombstone will eventually be removed (together with the actual data) from the journal.

JournalWriterInterface is currently implemented for memory, stream and kafka:

  • memory is being used to perform tests, and except from ensuring that clients will not crash when seeing tombstones, deletion should not be a concern otherwise.
  • stream is being used to serialize data from the archive later used as fixtures in our tests. It does not seem to be of any interest to have tombstones in these streams.
  • kafka is where we actually store the journal data.

Therefore, we only should implement delete in KafkaJournalWriter. Takedown requests and other archive mangling procedure will use this explicitly, keeping how Kafka data model works in mind.

It might be worth double-checking and eventually adding a test about how swh.journal.JournalClient handles tombstone, though.

@douardda: would you agree?