Page MenuHomeSoftware Heritage

Deploy sourceforge lister in production
Closed, ResolvedPublic

Description

The gist of it has been battle tested through the staging infra.
Deploy it to production.

Note:
sourceforge admins mentioned that we should not exceed 8 ingestions in parallel

Event Timeline

ardumont created this task.

Currently the next gen scheduler does not allow to limit the number of tasks per forge.
So a plan forward would be to allow the listing but prevent the origins from getting
scheduled for ingestion from the actual scheduler cogs running. Then, trigger the
ingestion "manually" [1] with dedicated worker(s) which would consume specifically from
sourceforge and respecting the limits set in the description.

[1] well some script reading from the scheduler and sending the git, svn, hg origins for
ingestion in a queue that worker would consume from [2]

[2] bzr origins, we don't support yet.

Another idea would be to add the SourceForge origins with enabled=false so they're not picked up by the scheduler, until we've done the first pass on them. This avoids needing to change the scheduler at all.

Dedicated worker17 node got provisionned to make the first run on the sourceforge
origins (svn and git for now). Remains some code to actually schedule the origins we are
interested. And some plumbing to actually consume those messages with respect to the
concurrency defined in the description.

As for the actual listing, its first pass is done [1] [2]:

[1] scheduler info:

softwareheritage-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge';
 count
--------
 338281
(1 row)

[2] worker logs:

Jun 02 07:15:45 worker11 python3[1407482]: [2021-06-02 07:15:45,705: INFO/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[e29c07ff-b01f-4739-a820-1d326e76ad63] succeeded in 83421.67006923165s: {'pages': 258898, 'o
rigins': 338327}

Note: the is a small number discrepancy (46 less in the db) but the order of magnitude
is roughly the same so i guess it's fine.

A small issue was found, @Alphare fixed it ^.
Notification to the sourceforge people about the ingestion starting soon got sent.

In the mean time, existing dataset from the first listing got adapted according to the fix (staging & prod scheduler db) [1] [2]
And the fix got deployed (staging/prod workers restarted).

Now on to actually deploying the dedicated loader and adapted some code to schedule correctly for the proper ingestion scheme.

[1] staging

swh-scheduler=> update listed_origins set url='https://' || url where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f';

UPDATE 338223
swh-scheduler=>
swh-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10;
                         url
------------------------------------------------------
 https://git.code.sf.net/p/new-1/code
 https://git.code.sf.net/p/root2raj-test/code
 https://git.code.sf.net/p/kernel-whyred/code
 https://git.code.sf.net/p/youtuber/code
 https://git.code.sf.net/p/surnubs/code
 https://git.code.sf.net/p/podcastliam/code
 https://git.code.sf.net/p/centos-repos/code
 https://git.code.sf.net/p/psnidck/git
 https://hg.code.sf.net/p/psnidck/mercurial
 https://git.code.sf.net/p/library-software-free/code
(10 rows)

swh-scheduler=> commit;
COMMIT

[2] prod

softwareheritage-scheduler=> update listed_origins set url='https://' || url where lister_id='b678cfc3-2780-4186-9186-d
78a14bd4958';
UPDATE 338281
softwareheritage-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10;
                       url
-------------------------------------------------
 https://bzr.code.sf.net/p/abandonedlands/code
 https://bzr.code.sf.net/p/adchppgui/code
 https://bzr.code.sf.net/p/admos/bazaar
 https://bzr.code.sf.net/p/afros-update/bazaar
 https://bzr.code.sf.net/p/alternityshadow/code
 https://bzr.code.sf.net/p/amyunix2/bazaar
 https://bzr.code.sf.net/p/anamnesis/code
 https://bzr.code.sf.net/p/anubisstegano/code
 https://bzr.code.sf.net/p/apreta/code
 https://bzr.code.sf.net/p/arabicontology/bazaar

This started in worker17 with the content of the diff ^:

(ve) ardumont@worker17:~/swh-scheduler$ export SWH_CONFIG_FILENAME=/etc/softwareheritage/scheduler/listener-runner.yml
(ve) ardumont@worker17:~/swh-scheduler$ interval=600; while true; do
>   swh scheduler -C $SWH_CONFIG_FILENAME  \
>     origin send-to-celery \
>       --policy never_visited_oldest_update_first  \
>       --without-enabled \
>       --lister-uuid 'b678cfc3-2780-4186-9186-d78a14bd4958' \
>       --queue oneshot:swh.loader.git.tasks.UpdateGitRepository \
>       git
>   sleep $interval; done
10000 slots available in celery queue
10000 visits to send to celery
150 slots available in celery queue
150 visits to send to celery
...

The ingestion is happening on the same worker with a concurrency of 4.
Some of those new origins can be seen in the archive already [1]

[1] https://archive.softwareheritage.org/browse/search/?q=https%3A%2F%2Fgit.code.sf&with_visit=true&with_content=true

ardumont changed the task status from Open to Work in Progress.Jun 3 2021, 6:18 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

Heads up:

  • Concurrency bumped to 6 (for the loader).
  • Migration of the mercurial origins dataset from https to http (scheduler prod/staging in progress) [1]
  • Incremental lister deployed

[1] staging (roughly the same went for production)

swh-scheduler=> update listed_origins set url=replace(url, 'https://', 'http://') where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f' and url like 'https://%' and visit_type='hg';
UPDATE 27489
Time: 994.239 ms

swh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'https://%';
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

Time: 81.923 ms
swh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'http://%';
+-------+
| count |
+-------+
| 27489 |
+-------+
(1 row)

Time: 69.252 ms

Current status, counting only git and svn origins, 26.8% [1] got done in ~24h
(somewhat... [2]).

As for the details per type [1]:

  • 41% done for git (~24h)
  • 2% done for svn in (~4h somewhat [2])

It's the same worker which ingests both. (I'll check for an ETA on monday if it's not
done already)

[1] out of [3]

(* 100 (/ (+ 2020.0 73540.0) (+ 101624.0 180319.0)))  ;; 26.79974321050709
(* 100 (/ 73540.0 180319.0))                          ;; 40.7832785230619
(* 100 (/ 2020.0 101624.0))                           ;; 1.9877194363536175

[2] The ingestion only started with a concurrency of 4 and only for the git origins. The
next day, concurrency got bumpted to 6 and the svn origins got thrown in the mix...

[3]

softwareheritage-scheduler=> select visit_type, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' group by visit_type;
+------------+--------+
| visit_type | count  |
+------------+--------+
| svn        | 101624 |
| hg         |  27497 |
| git        | 180319 |
| cvs        |  28622 |
| bzr        |    290 |
+------------+--------+
(5 rows)

Time: 7032.441 ms (00:07.032)

softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf%';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-04 15:13:28.030285+00 | 73540 |
+-------------------------------+-------+
(1 row)

Time: 83736.834 ms (01:23.737)
softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf%';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-04 15:15:13.416308+00 |  2020 |
+-------------------------------+-------+
(1 row)

Time: 12233.012 ms (00:12.233)

Still running, both svn and git svn origins are ingested regularly.

We are up to 96k origins down now (out of ~280k both svn and git).

The worker17 got reworked a bit to use tmpfs [1] and with an increase in ram (from 32 to
64g).

[1]

softwareheritage=> select now(), count(*) from origin where url like 'https://%.code.sf.net%';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-08 07:33:34.084783+00 | 96794 |
+-------------------------------+-------+
(1 row)

Time: 64354.297 ms (01:04.354)

softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf.net%';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-08 09:12:22.683031+00 | 15274 |
+-------------------------------+-------+
(1 row)

Time: 68655.542 ms (01:08.656)

softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf.net%';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-08 09:11:46.340185+00 | 81522 |
+-------------------------------+-------+
(1 row)

Time: 105006.727 ms (01:45.007)

[2] The disk io pattern is quite aggressive due to the svn loader implementation. That change was well received by the machine ;) as can be seen in the following graph [3].

[3] https://grafana.softwareheritage.org/goto/VAhvV8eMz

It's deployed and the ingestion is ongoing.
Monitoring of the ingestion will be moved to a dedicated task [1]
Closing this now.

[1] T3374