When new contents (db softwareheritage) are added to the storage, we need to add the equivalent entries to the archiver's table (db softwareheritage-archiver).
This is in order to reference the blobs as contents missing in the archive backups, so that will be scheduled for archival.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T239 preserve at least 2 copies of each content object | ||
Migrated | gitlab-migration | T240 content archiver | ||
Migrated | gitlab-migration | T424 swh-journal: persistent journal infrastructure to record additions to the swh-storage | ||
Migrated | gitlab-migration | T494 swh-journal: archiver-client: Keep archiver table in sync with new contents |
Event Timeline
I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.
I'm still unsure which one would be best, but we need to evaluate the different possibilities before acting. Have you considered alternatives to synchronous addition done by content_add or not yet?
I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.
According to our latest discussion that's the way forward.
Especially so since the archiver and the storage have now their db separated.
Meaning, the current storage no longer knows the archiver's db schema part.
Subscribing to a feed events regarding new contents will trigger some mechanism that will update the archiver's content table. Thus the archiver when triggered, will just see contents and deal with them as it's supposed to.
@qcampos I'd be more inclined to:
- rename this task as content archiver update and make it T424 as blocking task.
- and even maybe rename T240 content archiver as content archiver - first run (or something)
What do you think?
Yes, I think we need to split the archival task to separate the first run that ensure we have a copy of each content from the full archiver we will have more time to improve.
I didn't think enough about this difference. As the archival was a high priority, I took the first easy & quick solution that comes to my mind, forgetting that this update was a long-run problem.
Not sure if there is advantages, but there is no disadvantage in using an asynchronous update of the archiver db, as the archival itself is asynchronous in the end.
I have a working POC for this which uses swh-journal as basis.
Adding a:
- swh.journal.client consumer in swh-journal (which reads from the publisher's messages, T424). It subscribes to a list of parameterized events.
- Subclassing the swh.journal.client and declaring it in swh.storage.archiver.updater (which needs a better name) whose job is solely the creation of new content. -> At message reception event (only subscribed at content ones), it inserts the new id (if not already there) in content_archiver with status present for uffizi and absent for banco and azure (as parameters).
I'll make differentials (swh-journal, swh-storage) to show it later.
I'm sure there are plenty of improvments possible (and refactoring, e.g. some common code with the publisher/listener for example).
Cheers,