Page MenuHomeSoftware Heritage

swh-journal: persistent journal infrastructure to record additions to the swh-storage
Closed, ResolvedPublic


We want to create a persistent journal of all additions (and maybe modifications, in the future) to the software heritage storage.
For example, each new tuple added to the content table (i.e., blobs) should have a timestamped entry in the journal; same for each revision (e.g. git commit), release (e.g., git tag), etc.

The journal can then be used as upstream data source (or "publisher") for various kind of downstreams (or "subscribers"). Two plausible subscribers are:

  • batch processors of contents added to the software heritage storage, e.g., to compute file types, lines of code, ctags, etc. Changes in the journal can be used to fill appropriate job queues that relevant workers will consume
  • any entity who would like to stay up to date with what happens in software heritage storage but does not necessarily want to be a full mirror (mirrors might need a different infrastructure), e.g., compliance industry partners

To implement this we need at least two components:

  • client code that will be used to emit events that will populate the journal. This part can either go in swh.core or, if minimizing dependencies is a concern here, in a new, separate top-level swh.journal module (that might on the other hand be overkill). Client code will define the submission API to interact with the journal
  • backend code that will store journal entry. As a first approximation Apache Kafka might be the right tool for this job

[ this task tracks the result of separate but complementary discussions between myself, @rdicosmo and @olasd ]

Related Objects

Event Timeline

We have no guarantee that the internal object ids are monotonic: concurrent transactions can make object_ids of objects go backwards.

We are now settling on a tiered architecture as described in the following diagram :

Its components are as follows:

  • the listener pushes notifications for new objects to a new object queue, using triggers and PostgreSQL's built-in NOTIFY support.
  • the swh.journal publisher pulls object ids from the new object queue, retrieves the corresponding data in the storage and pushes it to the journal.
  • the swh.journal client knows how to read all the objects from the journal
  • the swh.journal checker compares the lists of objects from the journal (using the swh.journal client) and from the database, and pushes the ids of objects missing in the journal to the new object queue

The swh.journal checker allows bootstrapping the journal with all the data that has been inserted into the database so far, and can run periodically, as the listener cannot guarantee that all insertions have been noticed.

ardumont renamed this task from persistent journal infrastructure to record additions to the swh-storage to swh-journal: persistent journal infrastructure to record additions to the swh-storage.Oct 18 2018, 9:37 AM
ardumont added a subscriber: ardumont.
douardda claimed this task.
douardda added a subscriber: douardda.


gitlab-migration changed the status of subtask T1017: Estimate for Kafka cluster specifications from Resolved to Migrated.