HomeSoftware Heritage

Add support for indexing directly from the journal client

Description

Add support for indexing directly from the journal client

Before this commit, the journal client only created scheduler tasks,
which then run the indexers.

This commit adds support for a new flow: skipping the scheduler,
to run indexers directly.
This new behavior is triggered by adding a new argument on the CLI,
which is the name of the indexer to run (currently, only
origin-intrinsic-metadata).

This has the following consequences:

  • a crash in an indexer will now hang the whole thing (which is arguably good)
  • the journal client will probably need to be parallelized to keep up with the load
  • we can remove an existence check for origins

In term of deployment:

  1. stop the old journal client
  2. wait for all tasks to finish
  3. stop and remove celery workers and queues
  4. start the new journal client (it can reuse the group_id to avoid re-indexing, but I think it is a good opportunity to reindex because of all the temporary failures we had over time)

Details

Provenance
vlorentzAuthored on May 25 2022, 3:37 PM
vlorentzPushed on May 30 2022, 3:55 PM
Differential Revision
D7899: Add support for indexing directly from the journal client
Parents
rDCIDX35ff46ef5bb2: Refactor base indexers to provide a process_journal_objects method
Branches
Unknown
Tags
Unknown
References
tag: v1.1.0
Build Status
Buildable 29599
Build 46255: test-and-buildJenkins console · Jenkins