Page MenuHomeSoftware Heritage

Make the journal client create tasks for multiple origins instead of one at a time.
ClosedPublic

Authored by vlorentz on Jun 18 2019, 11:58 AM.

Details

Summary

It decreases the load on swh-scheduler, especially the listener.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jun 18 2019, 11:58 AM
olasd added a subscriber: olasd.Jun 18 2019, 3:17 PM
olasd added inline comments.
swh/indexer/journal_client.py
25–34

Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits?

A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin.

vlorentz added inline comments.Jun 18 2019, 3:25 PM
swh/indexer/journal_client.py
25–34

It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used.

ardumont accepted this revision.Jun 19 2019, 10:17 AM
ardumont added a subscriber: ardumont.

Sounds good to me.

This revision is now accepted and ready to land.Jun 19 2019, 10:17 AM
This revision was automatically updated to reflect the committed changes.