Differential D1595

Make the journal client create tasks for multiple origins instead of one at a time.
ClosedPublic
Actions

Authored by vlorentz on Jun 18 2019, 11:58 AM.

Tags

None

Subscribers

Details

Reviewers

Group Reviewers

Commits

rDCIDX7e6633bf2c0e: Make the journal client create tasks for multiple origins instead of one at a…

Summary

It decreases the load on swh-scheduler, especially the listener.

Diff Detail

Repository

rDCIDX Metadata indexer

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Jun 18 2019, 11:58 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptJun 18 2019, 11:58 AM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/550/ for more details.

Harbormaster completed remote builds in B6273: Diff 5294.Jun 18 2019, 12:05 PM

olasd added a subscriber: olasd.Jun 18 2019, 3:17 PM

olasd added inline comments.

swh/indexer/journal_client.py
24–33	Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits? A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin.

vlorentz added inline comments.Jun 18 2019, 3:25 PM

swh/indexer/journal_client.py
24–33	It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used.

Sounds good to me.

This revision is now accepted and ready to land.Jun 19 2019, 10:17 AM

Closed by commit rDCIDX7e6633bf2c0e: Make the journal client create tasks for multiple origins instead of one at a… (authored by vlorentz). · Explain WhyJun 20 2019, 10:12 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

swh/

indexer/

journal_client.py

21 lines

tests/

test_journal_client.py

98 lines

Diff 5384

swh/indexer/journal_client.py

Loading...

swh/indexer/tests/test_journal_client.py

Loading...