swh-journal: archiver-client: Keep archiver table in sync with new contents
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	qcampos
	Jul 18 2016, 4:11 PM

Description

When new contents (db softwareheritage) are added to the storage, we need to add the equivalent entries to the archiver's table (db softwareheritage-archiver).
This is in order to reference the blobs as contents missing in the archive backups, so that will be scheduled for archival.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T239 preserve at least 2 copies of each content object
Migrated	gitlab-migration	T240 content archiver
Migrated	gitlab-migration	T424 swh-journal: persistent journal infrastructure to record additions to the swh-storage
Migrated	gitlab-migration	T494 swh-journal: archiver-client: Keep archiver table in sync with new contents

Event Timeline

qcampos created this task.Jul 18 2016, 4:11 PM

qcampos created this object in space S1 Public.

I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.

I'm still unsure which one would be best, but we need to evaluate the different possibilities before acting. Have you considered alternatives to synchronous addition done by content_add or not yet?

ardumont updated the task description. (Show Details)Jul 19 2016, 2:14 PM

I'm not yet 100% sure that content_add is the place where we want to update the archive table. Another possibility, for instance, would be relying on the upcoming persistent log (T424) and some watcher for it that will update the archiver table.

According to our latest discussion that's the way forward.

Especially so since the archiver and the storage have now their db separated.
Meaning, the current storage no longer knows the archiver's db schema part.

Subscribing to a feed events regarding new contents will trigger some mechanism that will update the archiver's content table. Thus the archiver when triggered, will just see contents and deal with them as it's supposed to.

@qcampos I'd be more inclined to:

rename this task as content archiver update and make it T424 as blocking task.
and even maybe rename T240 content archiver as content archiver - first run (or something)

What do you think?

Yes, I think we need to split the archival task to separate the first run that ensure we have a copy of each content from the full archiver we will have more time to improve.

I didn't think enough about this difference. As the archival was a high priority, I took the first easy & quick solution that comes to my mind, forgetting that this update was a long-run problem.
Not sure if there is advantages, but there is no disadvantage in using an asynchronous update of the archiver db, as the archival itself is asynchronous in the end.

qcampos renamed this task from Improve storage.content_add function with a way to notify the archival db that a new content have been added to Content_archiver update.Jul 20 2016, 11:51 AM

qcampos renamed this task from Content_archiver update to Add a way to update content_archive table when a new content is added.

qcampos added a parent task: T424: swh-journal: persistent journal infrastructure to record additions to the swh-storage.Aug 9 2016, 5:47 PM

olasd mentioned this in T569: archiver: Handle forced copies of contents not registered in the archiver database.Sep 22 2016, 8:39 AM

zack removed qcampos as the assignee of this task.Feb 12 2017, 6:17 PM

zack added a project: Restricted Project.

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:39 PM

ardumont claimed this task.Feb 24 2017, 11:31 AM

I have a working POC for this which uses swh-journal as basis.

Adding a:

swh.journal.client consumer in swh-journal (which reads from the publisher's messages, T424). It subscribes to a list of parameterized events.
Subclassing the swh.journal.client and declaring it in swh.storage.archiver.updater (which needs a better name) whose job is solely the creation of new content. -> At message reception event (only subscribed at content ones), it inserts the new id (if not already there) in content_archiver with status present for uffizi and absent for banco and azure (as parameters).

I'll make differentials (swh-journal, swh-storage) to show it later.
I'm sure there are plenty of improvments possible (and refactoring, e.g. some common code with the publisher/listener for example).

Cheers,

ardumont mentioned this in D180: Add a journal client base class to process messages.Feb 25 2017, 1:16 AM

ardumont mentioned this in D181: Add journal client implementation which updates content archiver db with new contents.Feb 25 2017, 1:30 AM

ardumont mentioned this in D183: Add journal_publisher and archiver_content_updater's configuration.Mar 1 2017, 11:16 AM

ardumont mentioned this in rDSTOc38a452d4f4c: archiver.storage: Add content_archive_content_add endpoint.Mar 2 2017, 3:58 PM

ardumont mentioned this in rDSTO605ca003638d: Refactor: Merge common behavior in director and content updater client.

ardumont mentioned this in rDSTO269a731b6f74: content_archive_add: Use the right 'missing' status.

ardumont mentioned this in rDJNL389a9a34f18f: Add a journal client base class to process messages.Mar 13 2017, 11:17 AM

ardumont mentioned this in rDJNL3831d5f72f9d: swh.journal.client: Ensure options are correctly set when starting.

ardumont mentioned this in rDSTO4f1d48ce6a0a: Add journal client to update content archiver with new content.Mar 13 2017, 2:30 PM

ardumont mentioned this in rDSTO6136af8211e0: d/control: Add swh-journal dependency to swh.storage.archiver.

ardumont mentioned this in rDSTO1dc2069d3ab0: swh.storage.archiver.updater: Call directly content_archive_add.

ardumont mentioned this in rSPSITE297c7543978e: data/defaults: Add archiver_content_updater's configuration.Mar 13 2017, 2:38 PM

ardumont mentioned this in rSPPROF9a785d7b935c: swh::deploy::archiver_content_updater: Add manifest.

ardumont mentioned this in rSPSITE111f3de36cf7: data/defaults: journal_publisher/content_updater: Update topic name.Mar 24 2017, 2:02 PM

ardumont mentioned this in rSPSITEf27220b90bbb: data/defaults: archiver.content: Rename option to right name.

Well, strike POC, i had something that worked at that time.

ardumont mentioned this in T864: Indexers - Find and implement a proper scheduling content messages indexing method.Dec 2 2017, 1:08 PM

ardumont mentioned this in T867: Separate indexers' database model to its own database - meta task.Dec 6 2017, 12:49 PM

ardumont renamed this task from Add a way to update content_archive table when a new content is added to Keep archiver table content_archive in sync with new contents.Dec 7 2017, 10:13 PM

ardumont renamed this task from Keep archiver table content_archive in sync with new contents to Keep archiver table in sync with new contents.

ardumont updated the task description. (Show Details)

ardumont added a project: Journal.

ardumont mentioned this in T874: Keep indexer table in sync with new contents.Dec 7 2017, 10:17 PM

ardumont mentioned this in rSPSITE9a785d7b935c: swh::deploy::archiver_content_updater: Add manifest.Jun 15 2018, 2:29 PM

ardumont renamed this task from Keep archiver table in sync with new contents to swh-journal: archiver-client: Keep archiver table in sync with new contents.Oct 18 2018, 9:40 AM

ardumont closed this task as Invalid.Jan 13 2019, 12:33 PM

This task has been migrated to GitLab.

swh-journal: archiver-client: Keep archiver table in sync with new contentsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

swh-journal: archiver-client: Keep archiver table in sync with new contents
Closed, MigratedEdits Locked
Actions

Related Objects
Search...