swh-journal: persistent journal infrastructure to record additions to the swh-storage
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	May 29 2016, 5:55 PM

Description

We want to create a persistent journal of all additions (and maybe modifications, in the future) to the software heritage storage.
For example, each new tuple added to the content table (i.e., blobs) should have a timestamped entry in the journal; same for each revision (e.g. git commit), release (e.g., git tag), etc.

The journal can then be used as upstream data source (or "publisher") for various kind of downstreams (or "subscribers"). Two plausible subscribers are:

batch processors of contents added to the software heritage storage, e.g., to compute file types, lines of code, ctags, etc. Changes in the journal can be used to fill appropriate job queues that relevant workers will consume
any entity who would like to stay up to date with what happens in software heritage storage but does not necessarily want to be a full mirror (mirrors might need a different infrastructure), e.g., compliance industry partners

To implement this we need at least two components:

client code that will be used to emit events that will populate the journal. This part can either go in swh.core or, if minimizing dependencies is a concern here, in a new, separate top-level swh.journal module (that might on the other hand be overkill). Client code will define the submission API to interact with the journal
backend code that will store journal entry. As a first approximation Apache Kafka might be the right tool for this job

[ this task tracks the result of separate but complementary discussions between myself, @rdicosmo and @olasd ]

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T424 swh-journal: persistent journal infrastructure to record additions to the swh-storage
Migrated	gitlab-migration	T440 deploy apache kafka on getty VM
Migrated	gitlab-migration	T507 document licensing of kafka and related client modules
Migrated	gitlab-migration	T494 swh-journal: archiver-client: Keep archiver table in sync with new contents
Migrated	gitlab-migration	T525 Allow bulk-listing of objects by content-id
Migrated	gitlab-migration	T526 Add notifications support to swh.storage
Migrated	gitlab-migration	T527 Insert newly created objects in the journal
Migrated	gitlab-migration	T528 swh-journal: Create a journal client listing objects of a given type
Migrated	gitlab-migration	T529 swh-journal: Create a journal checker comparing object lists between journal and database
Migrated	gitlab-migration	T1017 Estimate for Kafka cluster specifications
Migrated	gitlab-migration	T1275 swh-journal: Complete missing snapshot insertion event from storage to journal
Migrated	gitlab-migration	T1276 swh-journal: Add tests
Migrated	gitlab-migration	T1277 swh-journal: Create a journal client for listing origin visits
Migrated	gitlab-migration	T1278 swh-journal: the monitoring tool question!
Migrated	gitlab-migration	T1279 swh-journal: The schema migration problem
Migrated	gitlab-migration	T1468 journal: Deploy the publisher

Event Timeline

zack created this task.May 29 2016, 5:55 PM

zack added a parent task: T359: Indexers: batch content analyzer infrastructure.May 29 2016, 5:57 PM

zack created subtask T440: deploy apache kafka on getty VM.Jun 13 2016, 4:11 PM

olasd closed subtask T440: deploy apache kafka on getty VM as Resolved.Jun 14 2016, 2:55 PM

zack mentioned this in T494: swh-journal: archiver-client: Keep archiver table in sync with new contents.Jul 18 2016, 5:18 PM

zack created subtask T507: document licensing of kafka and related client modules.Jul 22 2016, 9:27 AM

zack assigned this task to olasd.Jul 22 2016, 9:39 AM

olasd closed subtask T507: document licensing of kafka and related client modules as Resolved.Jul 26 2016, 2:43 PM

qcampos added a subtask: T494: swh-journal: archiver-client: Keep archiver table in sync with new contents.Aug 9 2016, 5:47 PM

olasd created subtask T525: Allow bulk-listing of objects by content-id.Aug 11 2016, 4:34 PM

olasd closed subtask T525: Allow bulk-listing of objects by content-id as Wontfix.Aug 16 2016, 12:35 PM

We have no guarantee that the internal object ids are monotonic: concurrent transactions can make object_ids of objects go backwards.

We are now settling on a tiered architecture as described in the following diagram :

Its components are as follows:

the swh.storage listener pushes notifications for new objects to a new object queue, using triggers and PostgreSQL's built-in NOTIFY support.

the swh.journal publisher pulls object ids from the new object queue, retrieves the corresponding data in the storage and pushes it to the journal.

the swh.journal client knows how to read all the objects from the journal

the swh.journal checker compares the lists of objects from the journal (using the swh.journal client) and from the database, and pushes the ids of objects missing in the journal to the new object queue

The swh.journal checker allows bootstrapping the journal with all the data that has been inserted into the database so far, and can run periodically, as the swh.storage listener cannot guarantee that all insertions have been noticed.

olasd created subtask T526: Add notifications support to swh.storage.Aug 16 2016, 6:29 PM

olasd created subtask T527: Insert newly created objects in the journal.Aug 16 2016, 6:31 PM

olasd created subtask T528: swh-journal: Create a journal client listing objects of a given type.

olasd created subtask T529: swh-journal: Create a journal checker comparing object lists between journal and database.Aug 16 2016, 6:34 PM

olasd removed a parent task: T359: Indexers: batch content analyzer infrastructure.

olasd closed subtask T526: Add notifications support to swh.storage as Resolved.Aug 19 2016, 3:56 PM

olasd changed the status of subtask T527: Insert newly created objects in the journal from Open to Work in Progress.Aug 23 2016, 6:14 PM

ardumont mentioned this in D180: Add a journal client base class to process messages.Feb 25 2017, 1:16 AM

ardumont mentioned this in D182: Deploy journal_publisher and archiver_content_updater manifest.Mar 1 2017, 11:15 AM

ardumont mentioned this in D183: Add journal_publisher and archiver_content_updater's configuration.

ardumont mentioned this in rDJNL389a9a34f18f: Add a journal client base class to process messages.Mar 13 2017, 11:17 AM

ardumont mentioned this in rDJNLc3f2cc60ac74: requirements: Add kafka dependency.

ardumont mentioned this in rDJNL4b45b74a975e: doc: Improve wording about SWHJournalClient class.

ardumont mentioned this in rDJNL3831d5f72f9d: swh.journal.client: Ensure options are correctly set when starting.

ardumont mentioned this in rSPSITEb57b0ea2ec43: data/defaults: Add journal_publisher's configuration.Mar 13 2017, 2:38 PM

ardumont mentioned this in rSPPROFea057adf470e: swh::deploy::journal_publisher: Add manifest.

ardumont mentioned this in rDJNL7de23ee5f235: swh.journal.publisher: Use predictable serialization for dict.Mar 24 2017, 12:54 PM

ardumont mentioned this in rSPPROFd039f578fcdf: deploy::journal_simple_checker_producer: Add manifest.Mar 24 2017, 1:52 PM

ardumont mentioned this in rSPPROFb68088a71603: deploy::journal: Read conf directory from configuration.Mar 24 2017, 1:57 PM

ardumont mentioned this in rSPSITE111f3de36cf7: data/defaults: journal_publisher/content_updater: Update topic name.Mar 24 2017, 2:02 PM

ardumont mentioned this in rSPSITEd0563ce46267: data/defaults: publisher: Remove test notion in consumer_id/publisher_id.

ardumont mentioned this in rSPSITE1a7109231298: data/defaults: Add a configuration directory variable for journal.

ardumont mentioned this in rSPSITEea057adf470e: swh::deploy::journal_publisher: Add manifest.Jun 15 2018, 2:29 PM

ardumont mentioned this in rSPSITEd039f578fcdf: deploy::journal_simple_checker_producer: Add manifest.

ardumont mentioned this in rSPSITEb68088a71603: deploy::journal: Read conf directory from configuration.

ardumont renamed this task from persistent journal infrastructure to record additions to the swh-storage to swh-journal: persistent journal infrastructure to record additions to the swh-storage.Oct 18 2018, 9:37 AM

ardumont claimed this task.Oct 18 2018, 3:46 PM

ardumont closed subtask T527: Insert newly created objects in the journal as Resolved.

ardumont added a project: Journal.Oct 18 2018, 4:15 PM

olasd removed a project: Developers.Oct 18 2018, 4:52 PM

ardumont closed subtask T1275: swh-journal: Complete missing snapshot insertion event from storage to journal as Resolved.Jan 13 2019, 12:27 PM

ardumont closed subtask T1277: swh-journal: Create a journal client for listing origin visits as Resolved.Jan 13 2019, 12:31 PM

ardumont closed subtask T494: swh-journal: archiver-client: Keep archiver table in sync with new contents as Invalid.Jan 13 2019, 12:33 PM

ardumont closed subtask T1468: journal: Deploy the publisher as Resolved.Jan 14 2019, 11:33 AM

ardumont closed subtask T1276: swh-journal: Add tests as Resolved.Apr 2 2019, 12:12 PM

ardumont removed ardumont as the assignee of this task.Jul 3 2019, 3:26 PM

ardumont added a subscriber: ardumont.

olasd closed subtask T1017: Estimate for Kafka cluster specifications as Resolved.Aug 23 2019, 6:42 PM

olasd closed subtask T528: swh-journal: Create a journal client listing objects of a given type as Resolved.Sep 3 2019, 1:22 PM

olasd closed subtask T529: swh-journal: Create a journal checker comparing object lists between journal and database as Wontfix.Sep 22 2020, 6:30 PM

yeah!

gitlab-migration changed the status of subtask T440: deploy apache kafka on getty VM from Resolved to Migrated.Oct 19 2022, 5:52 PM

gitlab-migration changed the status of subtask T1017: Estimate for Kafka cluster specifications from Resolved to Migrated.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T525: Allow bulk-listing of objects by content-id from Wontfix to Migrated.Jan 8 2023, 4:19 PM

gitlab-migration changed the status of subtask T526: Add notifications support to swh.storage from Resolved to Migrated.

gitlab-migration changed the status of subtask T527: Insert newly created objects in the journal from Resolved to Migrated.

gitlab-migration changed the status of subtask T528: swh-journal: Create a journal client listing objects of a given type from Resolved to Migrated.

gitlab-migration changed the status of subtask T529: swh-journal: Create a journal checker comparing object lists between journal and database from Wontfix to Migrated.

gitlab-migration changed the status of subtask T1278: swh-journal: the monitoring tool question! from Duplicate to Migrated.Jan 8 2023, 4:25 PM

gitlab-migration changed the status of subtask T1279: swh-journal: The schema migration problem from Wontfix to Migrated.

gitlab-migration changed the status of subtask T1468: journal: Deploy the publisher from Resolved to Migrated.

gitlab-migration changed the status of subtask T494: swh-journal: archiver-client: Keep archiver table in sync with new contents from Invalid to Migrated.Jan 8 2023, 9:56 PM

gitlab-migration changed the status of subtask T507: document licensing of kafka and related client modules from Resolved to Migrated.

gitlab-migration changed the status of subtask T1275: swh-journal: Complete missing snapshot insertion event from storage to journal from Resolved to Migrated.

gitlab-migration changed the status of subtask T1276: swh-journal: Add tests from Resolved to Migrated.

gitlab-migration changed the status of subtask T1277: swh-journal: Create a journal client for listing origin visits from Resolved to Migrated.

swh-journal: persistent journal infrastructure to record additions to the swh-storageClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

swh-journal: persistent journal infrastructure to record additions to the swh-storage
Closed, MigratedEdits Locked
Actions

Related Objects
Search...