Page MenuHomeSoftware Heritage

Add support for indexing directly from the journal client
ClosedPublic

Authored by vlorentz on May 25 2022, 3:40 PM.

Details

Summary

Before this commit, the journal client only created scheduler tasks,
which then run the indexers.

This commit adds support for a new flow: skipping the scheduler,
to run indexers directly.
This new behavior is triggered by adding a new argument on the CLI,
which is the name of the indexer to run (currently, only
origin-intrinsic-metadata).

This has the following consequences:

  • a crash in an indexer will now hang the whole thing (which is arguably good)
  • the journal client will probably need to be parallelized to keep up with the load
  • we can remove an existence check for origins

In term of deployment:

  1. stop the old journal client
  2. wait for all tasks to finish
  3. stop and remove celery workers and queues
  4. start the new journal client (it can reuse the group_id to avoid re-indexing, but I think it is a good opportunity to reindex because of all the temporary failures we had over time)

Part of T4273.

Depends on D7893.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7899 (id=28484)

Could not rebase; Attempt merge onto 32236e69eb...

Updating 32236e6..ecc84f3
Fast-forward
 swh/indexer/cli.py                        |  60 ++++++++++----
 swh/indexer/indexer.py                    |  64 ++++++++++++---
 swh/indexer/metadata.py                   |  27 ++++---
 swh/indexer/tests/test_cli.py             | 130 +++++++++++++++++++++++++++++-
 swh/indexer/tests/test_indexer.py         |  18 ++++-
 swh/indexer/tests/test_origin_metadata.py |   2 +-
 6 files changed, 260 insertions(+), 41 deletions(-)
Changes applied before test
commit ecc84f37600dde55754c7fa783707d21a6ace293
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed May 25 15:37:13 2022 +0200

    Add support for indexing directly from the journal client
    
    Before this commit, the journal client only created scheduler tasks,
    which then run the indexers.
    
    This commit adds support for a new flow: skipping the scheduler,
    to run indexers directly.
    This new behavior is triggered by adding a new argument on the CLI,
    which is the name of the indexer to run (currently, only
    `origin-intrinsic-metadata`).
    
    This has the following consequences:
    
    * a crash in an indexer will now hang the whole thing (which is
      arguably good)
    * the journal client will probably need to be parallelized to
      keep up with the load
    * we can remove an existence check for origins
    
    In term of deployment:
    
    1. stop the old journal client
    2. wait for all tasks to finish
    3. stop and remove celery workers and queues
    4. start the new journal client (it can reuse the group_id to avoid
       re-indexing, but I think it is a good opportunity to reindex because
       of all the temporary failures we had over time)

commit 35ff46ef5bb2acf6d93b1cc18f9ad1378d09cba8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed May 25 14:47:41 2022 +0200

    Refactor base indexers to provide a process_journal_objects method
    
    It will be used in a future commit to run indexers directly from
    a journal client.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/231/ for more details.

This revision is now accepted and ready to land.May 30 2022, 2:01 PM