After discussing with the upstream of scanoss tool, Roberto compulsed a list of (github)
repositories (large [1] and normal [2]) we are currently missing. Let's try and ingest
those using what we did for the chromium repository [3].
fwiw, we have a huge number of those reported by sentry [6].
Plan:
- Clean up large worker17 and 18 setup and keep them out of the standard consumption loop [4]
- Schedule large repositories on dedicated queue oneshot:swh.loader.git.tasks.UpdateGitRepository
- Schedule normal repositories on dedicated queue oneshot2:swh.loader.git.tasks.UpdateGitRepository
- Configure parallelism to not be too much as well (large repo queue: 1, normal repo queue: 5)
- Babysit processes (grafana dashboard [4])
[1] big:
[2] normal:
[3] T4283
[4] Recent tryouts on chromium and liferay-portal repositories currently failed possibly
due to the standard consumption happening in parallel. If large repositories is consumed
at the same time, the machine might become unable to finish both repositories...
[5] https://grafana.softwareheritage.org/goto/6HwEWEgVk?orgId=1
[6] https://sentry.softwareheritage.org/share/issue/bbcb3aef5b974dac9a3194f7bf8ede87/