It decreases the load on swh-scheduler, especially the listener.
- Group Reviewers
- rDCIDX7e6633bf2c0e: Make the journal client create tasks for multiple origins instead of one at a…
Considering this task only takes origin URLs (i.e. it only does metadata for the latest visit of any given origin), wouldn't it make more sense to do a set of origin URLs, rather than a list of visits?
A lot of the churn has come from rewriting the origin_visit table and sending all the data through to kafka, which will have generated a lot of hits for each given origin.
It only works to deduplicate visits which are sent by Kafka at the same time. I don't think the journal client is the correct approach to handle an initial load (or a large backfill). Instead, a batch schedule of all origins (with the appropriate CLI endpoint) should be used.