Page MenuHomeSoftware Heritage

Make bitbucket origins ingestion concurrent
Closed, ResolvedPublic


The current execution is running slowly for now. It's a simple loop over the mapping
file (one line is one origin) to trigger the ingestion per origin.

Make it go faster using the actual loader_oneshot.

For this, we need to iterate over the mapping file and send tasks
swh.loader.mercurial.tasks.LoadMercurial with the proper parameters.

That also have the following pros:

  • no need to deploy another special worker
  • have the systemctl logs pushed to elk, which make it browsable more easily (at least for staff) [1]


Event Timeline

ardumont renamed this task from Schedule properly the bitbucket origins to Make bitbucket origins ingestion go faster.Aug 3 2021, 10:05 AM
ardumont triaged this task as High priority.
ardumont created this task.
ardumont renamed this task from Make bitbucket origins ingestion go faster to Make bitbucket origins ingestion concurrent.Aug 3 2021, 10:07 AM

With the following change in the snippet code (check commit ^):

swhworker@worker17:~$ cat $base_dir/mapping-to-repos.txt |
while read dir url; do
  visit_date=`stat -c %z $repo_dir/.hg/blackbox.log | sed -E 's/ \+0000/+0000/'`;
  echo $url $repo_dir $visit_date;
done | SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_oneshot.yml \
  python3 --queue-name bitbucket-mercurial |
  tee -a scheduled-origins-bitbucket-20210805.txt

I've triggered back everything, including the 1k that were already run (mozilla-central
still ongoing so with the previous concurrency of 1, nothing bulged much since then).

That will also allow to have the systemd logs pushed to elasticsearch as well.

That will also allow to have the systemd logs pushed to elasticsearch as well.

This can be followed in the dedicated dashboard created for the occasion [1]


ardumont changed the task status from Open to Work in Progress.Aug 5 2021, 12:45 PM

All messages are queued in the oneshot:swh.loader.mercurial.tasks.LoadArchive queue [1]
Those are concurrently ingested by the worker17.

[1] Detailed origins queued in swhworker@worker17:~/scheduled-origins-bitbucket-20210805.txt.gz

swhworker@worker17:~$ ls -lah scheduled-origins-bitbucket-20210805.txt.gz
-rw-r--r-- 1 swhworker swhworker 16M Aug  5 17:49 scheduled-origins-bitbucket-20210805.txt.gz
swhworker@worker17:~$ gzip -dc scheduled-origins-bitbucket-20210805.txt.gz - | head -1
{'args': [], 'kwargs': {'url': '', 'directory': '/srv/storage/space/mirrors/boatbucket/refugees/data/f1/f139ec4a-f546-470f-9c65-4bdc07d499dd/main-repo.mercurial', 'visit_date': '2020-04-30 19:32:05.02174075
ardumont updated the task description. (Show Details)