It decreases the load on swh-scheduler, especially the listener.
Details
Details
- Reviewers
ardumont - Group Reviewers
Reviewers - Commits
- rDCIDX7e6633bf2c0e: Make the journal client create tasks for multiple origins instead of one at a…
Diff Detail
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Comment Actions
Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/550/ for more details.
swh/indexer/journal_client.py | ||
---|---|---|
24–33 | Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits? A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin. |
swh/indexer/journal_client.py | ||
---|---|---|
24–33 | It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used. |