Page MenuHomeSoftware Heritage

Make the journal client create tasks for multiple origins instead of one at a time.

Authored by vlorentz on Jun 18 2019, 11:58 AM.



It decreases the load on swh-scheduler, especially the listener.

Diff Detail

rDCIDX Metadata indexer
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jun 18 2019, 11:58 AM
olasd added a subscriber: olasd.Jun 18 2019, 3:17 PM
olasd added inline comments.

Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits?

A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin.

vlorentz added inline comments.Jun 18 2019, 3:25 PM

It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used.

ardumont accepted this revision.Jun 19 2019, 10:17 AM
ardumont added a subscriber: ardumont.

Sounds good to me.

This revision is now accepted and ready to land.Jun 19 2019, 10:17 AM
This revision was automatically updated to reflect the committed changes.