The gitea lister should be configured on the staging environment and tested with a task to list the codeberg.org forge.
(!) The task limit have to be increased in improve the listing speed (T2313#47504)
The gitea lister should be configured on the staging environment and tested with a task to list the codeberg.org forge.
(!) The task limit have to be increased in improve the listing speed (T2313#47504)
rDLS Listers | |||
D3903 | rDLS31efda62e7a2 gitea.lister: Fix uid to be unique across instance | ||
D3899 | rDLSe3c856b5eef5 utils.split_range: Split into not overlapping ranges | ||
D3897 | rDLS66a61f3dd234 gitea.tasks: Fix parameter name from 'sort' to 'order' | ||
rSPSITE puppet-swh-site | |||
D3896 | rSPSITEb51b0ceb29b0 lister configuration : add gitea lister tasks |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2313 Archive git.fsfe.org (Gitea) | ||
Migrated | gitlab-migration | T2577 Test gitea lister on staging environment |
swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.gitea WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea INFO:swh.scheduler.cli.task_type:Create task type list-gitea-full in scheduler INFO:swh.scheduler.cli.task_type:Create task type list-gitea-incremental in scheduler
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100 WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml Created 1 tasks Task 1263805 Next run: just now (2020-09-09 07:25:40+00:00) Interval: 90 days, 0:00:00 Type: list-gitea-full Policy: oneshot Args: Keyword args: limit: 100 url: 'https://codeberg.org/api/v1/'
swh-scheduler=# select * from task where type like '%gitea%'; -[ RECORD 1 ]----+------------------------------------------------------------------------------ id | 1263805 type | list-gitea-full arguments | {"args": [], "kwargs": {"url": "https://codeberg.org/api/v1/", "limit": 100}} next_run | 2020-09-09 07:25:40.025668+00 current_interval | 90 days status | next_run_scheduled policy | oneshot retries_left | 0 priority |
I'm just waiting for the validation of D3896 to activate the tasks.
For info, on my desktop with the docker environment, with a limit of 100, the lister takes 3s to list the complete codeberg forge :
swh-lister_1 | [2020-09-08 18:33:19,259: INFO/ForkPoolWorker-1] Task swh.lister.gitea.tasks.RangeGiteaLister[363e0b30-b13a-4f62-bd31-9847dfe62450] succeeded in 3.7196799100056523s: {'status': 'eventful'}
There is 3508 repositories detected.
the configuration is deployed and the listers were restarted.
The initial listing failed due to a concurrency problem. The problem is logged in sentry here : https://sentry.softwareheritage.org/share/issue/aec9c2af347e47ea84f51ace3bfe2f25/
It looks similar to T2070
I have tested to create a list-gitea-incremental task but it fails to but this time with another exception relative to an unexpected "sort" parameter : https://sentry.softwareheritage.org/share/issue/b0119b56f24347bcb58ac28c68685c62/
swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100 WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml Created 1 tasks Task 1267302 Next run: just now (2020-09-09 09:40:12+00:00) Interval: 1 day, 0:00:00 Type: list-gitea-incremental Policy: oneshot Args: Keyword args: limit: 100 url: 'https://codeberg.org/api/v1/'
The concurrency issue was reproduced locally on the docker environment with a concurrency of 5.
It seems the pages are listed several times during the job execution :
swh-lister_1 | [2020-09-09 14:04:05,742: INFO/ForkPoolWorker-4] listing repos starting at 10 swh-lister_1 | [2020-09-09 14:04:06,052: INFO/ForkPoolWorker-4] listing repos starting at 11 swh-lister_1 | [2020-09-09 14:04:13,819: INFO/ForkPoolWorker-3] listing repos starting at 10 ... swh-lister_1 | [2020-09-09 14:04:05,621: INFO/ForkPoolWorker-1] listing repos starting at 30 swh-lister_1 | [2020-09-09 14:04:05,970: INFO/ForkPoolWorker-1] listing repos starting at 31 swh-lister_1 | [2020-09-09 14:04:10,282: INFO/ForkPoolWorker-2] listing repos starting at 30 swh-lister_1 | [2020-09-09 14:04:10,949: ERROR/ForkPoolWorker-2] Task swh.lister.gitea.tasks.RangeGiteaLister[f25fb95c-fbf3-4ee6-9072-f4029d2d04c1] raised unexpected: IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "gitea_repo_pkey"\nDETAIL: Key (uid)=(3567) already exists.\n')
The test of new version v0.1.4 including the fix on the the range split, the uid change and the incremental task fix is ok.
Deployment :
swh-lister=# drop table gitea_repo; DROP TABLE
root@pergamon:~# clush -b -w @staging-workers 'apt-get update; apt install -y python3-swh.lister' ... root@pergamon:~# clush -b -w @staging-workers 'dpkg -l python3-swh.lister' --------------- worker[0-2].internal.staging.swh.network (3) --------------- Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==================-====================-============-================================================================= ii python3-swh.lister 0.1.4-1~swh1~bpo10+1 all Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...) # restart root@pergamon:~# clush -w @swh-workers -b systemctl restart swh-worker@loader_svn
root@scheduler0:~# apt-get update && apt install python3-swh.lister ... Unpacking python3-swh.lister (0.1.4-1~swh1~bpo10+1) over (0.1.2-1~swh1~bpo10+1) ... Setting up python3-swh.lister (0.1.4-1~swh1~bpo10+1) swhscheduler@scheduler0:~$ swh lister --db-url postgresql://swh-lister:*****@db0.internal.staging.swh.network:5432/swh-lister db-init
swh-lister=# \d gitea_repo Table "public.gitea_repo" Column | Type | Collation | Nullable | Default -------------+-----------------------------+-----------+----------+--------- name | character varying | | | full_name | character varying | | | html_url | character varying | | | origin_url | character varying | | | origin_type | character varying | | | last_seen | timestamp without time zone | | not null | task_id | integer | | | uid | character varying | | not null | instance | character varying | | | Indexes: "gitea_repo_pkey" PRIMARY KEY, btree (uid) "ix_gitea_repo_full_name" btree (full_name) "ix_gitea_repo_instance" btree (instance) "ix_gitea_repo_name" btree (name)
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100
swh-lister=# select count(*) from gitea_repo; count ------- 3506 (1 row)
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100
Sep 10 10:22:16 worker1 python3[273967]: [2020-09-10 10:22:16,897: INFO/MainProcess] Received task: swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584] Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,086: INFO/ForkPoolWorker-4] listing repos starting at 1 Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,315: INFO/ForkPoolWorker-4] Repositories already seen, stopping Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,320: INFO/ForkPoolWorker-4] Task swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584] succeeded in 0.4189604769926518s: {'status': 'uneventful'}
reopened to validate to complete process from the listing to the loading of some repository
There is now 962 repos imported on the staging archive
swh=# select count(*) from origin where url like 'https://codeberg%'; count ------- 962 (1 row)
This is a extract of some urls :
swh=# select * from origin where url like 'https://codeberg%' limit 10; id | url -----------+------------------------------------------------------- 91370242 | https://codeberg.org/Freeyourgadget/Gadgetbridge 91370243 | https://codeberg.org/steko/harris-matrix-data-package 91370244 | https://codeberg.org/Codeberg/build-deploy-gitea.git 91370245 | https://codeberg.org/Codeberg/gitea.git 91370246 | https://codeberg.org/Booteille/Phantom 91370247 | https://codeberg.org/Booteille/Nitterify 91370248 | https://codeberg.org/booteille/invidition/ 91370349 | https://codeberg.org/infosechandbook/blog-content.git 91382717 | https://codeberg.org/hayden/howl 117690007 | https://codeberg.org/niklas-fischer/schlaues-buch.git
An email was sent on the swh-devel mailing list to ask for reviews.
The deployment in production will be performed in the middle of week 38 is no problems are raised.
I've had mixed results with the testing. The codeberg/gitea stuff looks fine, but a lot of the origins I've tested from the search results of the first query suggested by @ardumont in that email are 404 for me. Some (random, non exhaustive examples):
Note that they all appear as "status: archived" in the search results.
That doesn't seem good/normal, does it?
That doesn't seem good/normal, does it?
Indeed, that's T2584 we opened last week, fixed by D3934.
We'll check with @anlambert so it lands soon and be deployed.
To continue the review without it right now, one should either replace %20 by
%2B which is not so great or find origins without + in it which is also not
great... ¯\_(ツ)_/¯
Cheers,
We'll check with @anlambert so it lands soon and be deployed.
Done
(we had a bit of a fight with debian package, thus the delay)