HomeSoftware Heritage

Add support for indexing directly from the journal client

This commit no longer exists in the repository. It may have been part of a branch which was deleted.

Description

Add support for indexing directly from the journal client

Before this commit, the journal client only created scheduler tasks,
which then run the indexers.

This commit adds support for a new flow: skipping the scheduler,
to run indexers directly.
This new behavior is triggered by adding a new argument on the CLI,
which is the name of the indexer to run (currently, only
origin-intrinsic-metadata).

This has the following consequences:

  • a crash in an indexer will now hang the whole thing (which is arguably good)
  • the journal client will probably need to be parallelized to keep up with the load
  • we can remove an existence check for origins

In term of deployment:

  1. stop the old journal client
  2. wait for all tasks to finish
  3. stop and remove celery workers and queues
  4. start the new journal client (it can reuse the group_id to avoid re-indexing, but I think it is a good opportunity to reindex because of all the temporary failures we had over time)

Details

Provenance
vlorentzAuthored on May 25 2022, 3:37 PM
vlorentzPushed on May 30 2022, 3:55 PM
Differential Revision
D7899: Add support for indexing directly from the journal client
Build Status
Buildable 29599
Build 46255: test-and-buildJenkins console · Jenkins

Commit No Longer Exists

This commit no longer exists in the repository.