Page MenuHomeSoftware Heritage

swh-journal: archiver-client: Keep archiver table in sync with new contents
Closed, MigratedEdits Locked

Description

When new contents (db softwareheritage) are added to the storage, we need to add the equivalent entries to the archiver's table (db softwareheritage-archiver).
This is in order to reference the blobs as contents missing in the archive backups, so that will be scheduled for archival.

Related Objects

Event Timeline

qcampos created this object in space S1 Public.

I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.

I'm still unsure which one would be best, but we need to evaluate the different possibilities before acting. Have you considered alternatives to synchronous addition done by content_add or not yet?

I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.

According to our latest discussion that's the way forward.

Especially so since the archiver and the storage have now their db separated.
Meaning, the current storage no longer knows the archiver's db schema part.

Subscribing to a feed events regarding new contents will trigger some mechanism that will update the archiver's content table. Thus the archiver when triggered, will just see contents and deal with them as it's supposed to.

@qcampos I'd be more inclined to:

  • rename this task as content archiver update and make it T424 as blocking task.
  • and even maybe rename T240 content archiver as content archiver - first run (or something)

What do you think?

Yes, I think we need to split the archival task to separate the first run that ensure we have a copy of each content from the full archiver we will have more time to improve.

I didn't think enough about this difference. As the archival was a high priority, I took the first easy & quick solution that comes to my mind, forgetting that this update was a long-run problem.
Not sure if there is advantages, but there is no disadvantage in using an asynchronous update of the archiver db, as the archival itself is asynchronous in the end.

qcampos renamed this task from Improve storage.content_add function with a way to notify the archival db that a new content have been added to Content_archiver update.Jul 20 2016, 11:51 AM
qcampos renamed this task from Content_archiver update to Add a way to update content_archive table when a new content is added.
zack removed qcampos as the assignee of this task.Feb 12 2017, 6:17 PM
zack added a project: Restricted Project.
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:39 PM

I have a working POC for this which uses swh-journal as basis.

Adding a:

  • swh.journal.client consumer in swh-journal (which reads from the publisher's messages, T424). It subscribes to a list of parameterized events.
  • Subclassing the swh.journal.client and declaring it in swh.storage.archiver.updater (which needs a better name) whose job is solely the creation of new content. -> At message reception event (only subscribed at content ones), it inserts the new id (if not already there) in content_archiver with status present for uffizi and absent for banco and azure (as parameters).

I'll make differentials (swh-journal, swh-storage) to show it later.
I'm sure there are plenty of improvments possible (and refactoring, e.g. some common code with the publisher/listener for example).

Cheers,

Well, strike POC, i had something that worked at that time.

ardumont renamed this task from Add a way to update content_archive table when a new content is added to Keep archiver table content_archive in sync with new contents.Dec 7 2017, 10:13 PM
ardumont renamed this task from Keep archiver table content_archive in sync with new contents to Keep archiver table in sync with new contents.
ardumont updated the task description. (Show Details)
ardumont added a project: Journal.
ardumont renamed this task from Keep archiver table in sync with new contents to swh-journal: archiver-client: Keep archiver table in sync with new contents.Oct 18 2018, 9:40 AM