Page MenuHomeSoftware Heritage

Make the journal client create tasks for multiple origins instead of one at a time.
ClosedPublic

Authored by vlorentz on Jun 18 2019, 11:58 AM.

Details

Summary

It decreases the load on swh-scheduler, especially the listener.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
journalclient-metadata-task-batch
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6273
Build 8679: tox-on-jenkinsJenkins
Build 8678: arc lint + arc unit

Event Timeline

olasd added inline comments.
swh/indexer/journal_client.py
25–34

Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits?

A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin.

swh/indexer/journal_client.py
25–34

It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used.

ardumont added a subscriber: ardumont.

Sounds good to me.

This revision is now accepted and ready to land.Jun 19 2019, 10:17 AM