Page MenuHomeSoftware Heritage

Make swh-journal independent from swh-storage or swh-objstorage
Closed, MigratedEdits Locked

Description

Problem

swh-journal and swh-storage are, for most of their content, intricated and very tighty coupled. Since swh-journal depends on swh-storage, it's very common that a fix or modification in swh-storage requires an update in swh-journal. But swh-storage itself depends on swh-journal (for the majority of the tests). So any modification of the journal may require a fix in storage. This interdependency make their evolution very difficult to manage, and shows the separation between the 2 projects is not located properly.

So it would make sense to (re)integrate (at least part of) swh-journal in swh-storage and break this circular dependency loop.

Also shw-journal depends on swh-objstorage for the content replayer part, and it would also make sense to extract the content replayer part away from the journal module. Since we cannot integrate it in swh-objstrorage to prevent a dependency we don't want either, best option is to provide the content replayer in a dedicated project.

Current situation

swh-journal

Currently, swh-journal is a rather small package (~3000 python sloc) and consists of several parts:

1/ writer

the writer (producer) part, in swh/journal/writer. This component is used from the storage exclusively to serialize any modification recorded in the storage as a message in the journal. 2 implementations of this writer component are provided: a kafka-based one (the "true" journal, used in production) and an in-memory version, for testing purpose. In fact, the JournalWriter API is very simple and consists in only a single method (plus a variant of this method):

  • write_addition(object_type, object) where object is (now) expected to be a model entity.
  • writes_additions([...]) for a list of objects.

The model object serialization to produce messages sent to the journal is in both cases specific to the kafka backend (the in-memory backend uses the same serialization functions as the kafka one).

So:

  • the journal writer part is very basic and does not depend on the storage,
  • the object serialization part is specific to the journal backend used (kafka) and does not depend on the storage, but only on the model.

2/ backfiller

The backfiller is a component very specific to the postgresql-based storage aiming at filling the journal from scratch from an existing (postgresql) storage.

This has nothing to do in the swh-journal package and should be moved in swh-storage.

3/ client

The journal client part consists in a class that allows to consume messages from kafka. There is no 'in-memory' implementation available for this component.

There is currently a limited list of accepted object types (which match what the storage can emit) but this constraint should be moved out of the this module. This later neither depends nor needs the storage, not even the model. The handling of incoming messages being the responsibility of the JournalClient user, via a callback.

4/ replayer

This component uses the JournalClient for inserting model objects in a storage. In fact, there are 2 replayers in this module, the graph-replayer responsible for filling a storage from a kafka journal, and the content-replayer responsible for filling an objstorage from a kafka journal and a source objstorage.

The graph-replayer obviously depends on the storage module, and should also be moved there. It's a storage-specific piece of code, not a journal specific one.

The content-replayer should be moved in the objstorage for the same reasons.

swh-storage

The code of the storage depends on the JournalWriter both within the code of the storage itself, and because it's used for tests (the in-memory journal writer).

So the swh-storage depends on the JournalWriter and nothing else.

Proposal

  • move the backfiller in swh-storage
  • move the graph-replayer in swh-storage
  • move the content-replayer in swh-objstorage
  • modify slightly the JournalClient to make it completely storage-agnostic.

Doing so, we should have (depends on means for tests also):

  • swh-journal depends on swh-model
  • swh-storage depends on swh-model,swh-journal
  • swh-objstorage depends on swh-journal

Some attention may be needed to ensure continuity of cli tools.

Event Timeline

douardda triaged this task as High priority.Apr 9 2020, 4:04 PM
douardda created this task.
douardda updated the task description. (Show Details)
douardda renamed this task from Merge parts of swh-journal in swh-storage to Make swh-journal independant from swh-storage or swh-objstorage.Apr 22 2020, 3:41 PM
douardda updated the task description. (Show Details)
ardumont renamed this task from Make swh-journal independant from swh-storage or swh-objstorage to Make swh-journal independent from swh-storage or swh-objstorage.Apr 22 2020, 3:50 PM
douardda claimed this task.

Let's consider this is done now.