Page MenuHomeSoftware Heritage

Add a journal client base class to process messages
ClosedPublic

Authored by ardumont on Feb 25 2017, 1:16 AM.

Details

Summary

This is a first draft to implement a base client from our swh-journal.

This can be used for example to trigger the creation of new content in
the archiver db.

Related T424
Related T494

Outside of that scope, i see some possible refactoring between the client class (a consumer) and the publisher (which is also a consumer).
Also between the publisher and the swh.storage.listener (the listener is a publisher).

Diff Detail

Repository
rDJNL Journal infrastructure
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont edited the summary of this revision. (Show Details)

doc: Improve docstring and default sample

I agree with the general goal of consistency between the producer and consumer configuration options.

However, I think we should clearly separate both and making one a subclass of the other sounds like a recipe for confusion.

Furthermore, we will need two producers: the "live" producer subscribing to database events, as well as the "catchup" producer which will have a feedback loop between the journal and the database. I think there's more value in providing a base class for both of those rather than for producer and consumer.

swh/journal/client.py
17–34

The topic wording is an internal kafka thing that doesn't really matter (except as an implementation detail) for swh.journal consumers.

Proposed wording:

A base client for the Software Heritage journal.

The current implementation of the journal uses Apache Kafka brokers to publish messages under a given topic prefix, with each object type using a specific topic under that prefix.

Clients subscribe to events specific to each object type by using the `object_types` configuration variable.

Clients can be sharded by setting the `client_id` to a common value across instances. The journal will share the message throughput across the nodes sharing the same client_id.

Messages are processed by the `process_objects` method in batches of maximum `max_messages`.
66

That variable should be made configurable: we should allow consumer to only care about live events without a (big) catchup phase.

I agree with the general goal of consistency between the producer and consumer configuration options.

Nice.

However, I think we should clearly separate both and making one a subclass of the other sounds like a recipe for confusion.

Ok, that sounds fair.

Furthermore, we will need two producers: the "live" producer subscribing to database events, as well as the "catchup" producer which will have a feedback loop between the journal and the database.

I believe what you mention refers to:

  • the live producer is the actual listener defined in swh.storage.listener (subscribed to db events and feeding new object events to the publisher's journal).
  • the catchup producer being the checker mentioned in T494 (which will be 'diffing' the journal against the db and effectively provide back the missing content ids to the journal).

I think there's more value in providing a base class for both of those rather than for producer and consumer.

Ok.

Adaptation according to proposed improvments

  • doc: Improve wording about SWHJournalClient class
  • Add option to allow consumer client to choose fetch policy

A few more nits but this looks reasonable to build upon.

swh/journal/client.py
40

Should be renamed to match the docstring (or the docstring changed to match)

63

Let's give a proper error message (with a ValueError).

While we're checking the configuration, let's hard-code the list of supported object types and check that the configuration matches (again, raising ValueError on unknown values).

olasd edited edge metadata.
This revision is now accepted and ready to land.Mar 2 2017, 6:02 PM

swh.journal.client: Ensure options are correctly set when starting

This revision was automatically updated to reflect the committed changes.