The gist of it has been battle tested through the staging infra.
Deploy it to production.
Note:
sourceforge admins mentioned that we should not exceed 8 ingestions in parallel
The gist of it has been battle tested through the staging infra.
Deploy it to production.
Note:
sourceforge admins mentioned that we should not exceed 8 ingestions in parallel
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T3315 archive SourceForge | ||
Migrated | gitlab-migration | T735 SourceForge lister | ||
Migrated | gitlab-migration | T3350 Deploy sourceforge lister in production |
Currently the next gen scheduler does not allow to limit the number of tasks per forge.
So a plan forward would be to allow the listing but prevent the origins from getting
scheduled for ingestion from the actual scheduler cogs running. Then, trigger the
ingestion "manually" [1] with dedicated worker(s) which would consume specifically from
sourceforge and respecting the limits set in the description.
[1] well some script reading from the scheduler and sending the git, svn, hg origins for
ingestion in a queue that worker would consume from [2]
[2] bzr origins, we don't support yet.
Another idea would be to add the SourceForge origins with enabled=false so they're not picked up by the scheduler, until we've done the first pass on them. This avoids needing to change the scheduler at all.
Dedicated worker17 node got provisionned to make the first run on the sourceforge
origins (svn and git for now). Remains some code to actually schedule the origins we are
interested. And some plumbing to actually consume those messages with respect to the
concurrency defined in the description.
As for the actual listing, its first pass is done [1] [2]:
[1] scheduler info:
softwareheritage-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge'; count -------- 338281 (1 row)
[2] worker logs:
Jun 02 07:15:45 worker11 python3[1407482]: [2021-06-02 07:15:45,705: INFO/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[e29c07ff-b01f-4739-a820-1d326e76ad63] succeeded in 83421.67006923165s: {'pages': 258898, 'o rigins': 338327}
Note: the is a small number discrepancy (46 less in the db) but the order of magnitude
is roughly the same so i guess it's fine.
A small issue was found, @Alphare fixed it ^.
Notification to the sourceforge people about the ingestion starting soon got sent.
In the mean time, existing dataset from the first listing got adapted according to the fix (staging & prod scheduler db) [1] [2]
And the fix got deployed (staging/prod workers restarted).
Now on to actually deploying the dedicated loader and adapted some code to schedule correctly for the proper ingestion scheme.
[1] staging
swh-scheduler=> update listed_origins set url='https://' || url where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f'; UPDATE 338223 swh-scheduler=> swh-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10; url ------------------------------------------------------ https://git.code.sf.net/p/new-1/code https://git.code.sf.net/p/root2raj-test/code https://git.code.sf.net/p/kernel-whyred/code https://git.code.sf.net/p/youtuber/code https://git.code.sf.net/p/surnubs/code https://git.code.sf.net/p/podcastliam/code https://git.code.sf.net/p/centos-repos/code https://git.code.sf.net/p/psnidck/git https://hg.code.sf.net/p/psnidck/mercurial https://git.code.sf.net/p/library-software-free/code (10 rows) swh-scheduler=> commit; COMMIT
[2] prod
softwareheritage-scheduler=> update listed_origins set url='https://' || url where lister_id='b678cfc3-2780-4186-9186-d 78a14bd4958'; UPDATE 338281 softwareheritage-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10; url ------------------------------------------------- https://bzr.code.sf.net/p/abandonedlands/code https://bzr.code.sf.net/p/adchppgui/code https://bzr.code.sf.net/p/admos/bazaar https://bzr.code.sf.net/p/afros-update/bazaar https://bzr.code.sf.net/p/alternityshadow/code https://bzr.code.sf.net/p/amyunix2/bazaar https://bzr.code.sf.net/p/anamnesis/code https://bzr.code.sf.net/p/anubisstegano/code https://bzr.code.sf.net/p/apreta/code https://bzr.code.sf.net/p/arabicontology/bazaar
This started in worker17 with the content of the diff ^:
(ve) ardumont@worker17:~/swh-scheduler$ export SWH_CONFIG_FILENAME=/etc/softwareheritage/scheduler/listener-runner.yml (ve) ardumont@worker17:~/swh-scheduler$ interval=600; while true; do > swh scheduler -C $SWH_CONFIG_FILENAME \ > origin send-to-celery \ > --policy never_visited_oldest_update_first \ > --without-enabled \ > --lister-uuid 'b678cfc3-2780-4186-9186-d78a14bd4958' \ > --queue oneshot:swh.loader.git.tasks.UpdateGitRepository \ > git > sleep $interval; done 10000 slots available in celery queue 10000 visits to send to celery 150 slots available in celery queue 150 visits to send to celery ...
The ingestion is happening on the same worker with a concurrency of 4.
Some of those new origins can be seen in the archive already [1]
Heads up:
[1] staging (roughly the same went for production)
swh-scheduler=> update listed_origins set url=replace(url, 'https://', 'http://') where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f' and url like 'https://%' and visit_type='hg'; UPDATE 27489 Time: 994.239 ms swh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'https://%'; +-------+ | count | +-------+ | 0 | +-------+ (1 row) Time: 81.923 ms swh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'http://%'; +-------+ | count | +-------+ | 27489 | +-------+ (1 row) Time: 69.252 ms
Current status, counting only git and svn origins, 26.8% [1] got done in ~24h
(somewhat... [2]).
As for the details per type [1]:
It's the same worker which ingests both. (I'll check for an ETA on monday if it's not
done already)
[1] out of [3]
(* 100 (/ (+ 2020.0 73540.0) (+ 101624.0 180319.0))) ;; 26.79974321050709 (* 100 (/ 73540.0 180319.0)) ;; 40.7832785230619 (* 100 (/ 2020.0 101624.0)) ;; 1.9877194363536175
[2] The ingestion only started with a concurrency of 4 and only for the git origins. The
next day, concurrency got bumpted to 6 and the svn origins got thrown in the mix...
[3]
softwareheritage-scheduler=> select visit_type, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' group by visit_type; +------------+--------+ | visit_type | count | +------------+--------+ | svn | 101624 | | hg | 27497 | | git | 180319 | | cvs | 28622 | | bzr | 290 | +------------+--------+ (5 rows) Time: 7032.441 ms (00:07.032) softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf%'; +-------------------------------+-------+ | now | count | +-------------------------------+-------+ | 2021-06-04 15:13:28.030285+00 | 73540 | +-------------------------------+-------+ (1 row) Time: 83736.834 ms (01:23.737) softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf%'; +-------------------------------+-------+ | now | count | +-------------------------------+-------+ | 2021-06-04 15:15:13.416308+00 | 2020 | +-------------------------------+-------+ (1 row) Time: 12233.012 ms (00:12.233)
Still running, both svn and git svn origins are ingested regularly.
We are up to 96k origins down now (out of ~280k both svn and git).
The worker17 got reworked a bit to use tmpfs [1] and with an increase in ram (from 32 to
64g).
[1]
softwareheritage=> select now(), count(*) from origin where url like 'https://%.code.sf.net%'; +-------------------------------+-------+ | now | count | +-------------------------------+-------+ | 2021-06-08 07:33:34.084783+00 | 96794 | +-------------------------------+-------+ (1 row) Time: 64354.297 ms (01:04.354) softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf.net%'; +-------------------------------+-------+ | now | count | +-------------------------------+-------+ | 2021-06-08 09:12:22.683031+00 | 15274 | +-------------------------------+-------+ (1 row) Time: 68655.542 ms (01:08.656) softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf.net%'; +-------------------------------+-------+ | now | count | +-------------------------------+-------+ | 2021-06-08 09:11:46.340185+00 | 81522 | +-------------------------------+-------+ (1 row) Time: 105006.727 ms (01:45.007)
[2] The disk io pattern is quite aggressive due to the svn loader implementation. That change was well received by the machine ;) as can be seen in the following graph [3].
It's deployed and the ingestion is ongoing.
Monitoring of the ingestion will be moved to a dedicated task [1]
Closing this now.
[1] T3374