We want to create a persistent journal of all additions (and maybe modifications, in the future) to the software heritage storage.
For example, each new tuple added to the content table (i.e., blobs) should have a timestamped entry in the journal; same for each revision (e.g. git commit), release (e.g., git tag), etc.
The journal can then be used as upstream data source (or "publisher") for various kind of downstreams (or "subscribers"). Two plausible subscribers are:
- batch processors of contents added to the software heritage storage, e.g., to compute file types, lines of code, ctags, etc. Changes in the journal can be used to fill appropriate job queues that relevant workers will consume
- any entity who would like to stay up to date with what happens in software heritage storage but does not necessarily want to be a full mirror (mirrors might need a different infrastructure), e.g., compliance industry partners
To implement this we need at least two components:
- client code that will be used to emit events that will populate the journal. This part can either go in swh.core or, if minimizing dependencies is a concern here, in a new, separate top-level swh.journal module (that might on the other hand be overkill). Client code will define the submission API to interact with the journal
- backend code that will store journal entry. As a first approximation Apache Kafka might be the right tool for this job
[ this task tracks the result of separate but complementary discussions between myself, @rdicosmo and @olasd ]