Page MenuHomeSoftware Heritage

Test gitea lister on staging environment
Closed, MigratedEdits Locked

Description

The gitea lister should be configured on the staging environment and tested with a task to list the codeberg.org forge.

(!) The task limit have to be increased in improve the listing speed (T2313#47504)

Event Timeline

vsellier changed the task status from Open to Work in Progress.Sep 8 2020, 4:34 PM
vsellier claimed this task.
vsellier created this task.
ardumont triaged this task as Normal priority.Sep 8 2020, 4:58 PM
ardumont added a project: Lister.
ardumont updated the task description. (Show Details)
  • task-type registered :
swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.gitea
WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal'
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea
INFO:swh.scheduler.cli.task_type:Create task type list-gitea-full in scheduler
INFO:swh.scheduler.cli.task_type:Create task type list-gitea-incremental in scheduler
  • The data model does't need to be created because it was already done in T2358
  • The task is created :
swhscheduler@scheduler0:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100
WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal'
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 1263805
  Next run: just now (2020-09-09 07:25:40+00:00)
  Interval: 90 days, 0:00:00
  Type: list-gitea-full
  Policy: oneshot
  Args:
  Keyword args:
    limit: 100
    url: 'https://codeberg.org/api/v1/'
swh-scheduler=# select * from task where type like '%gitea%';
-[ RECORD 1 ]----+------------------------------------------------------------------------------
id               | 1263805
type             | list-gitea-full
arguments        | {"args": [], "kwargs": {"url": "https://codeberg.org/api/v1/", "limit": 100}}
next_run         | 2020-09-09 07:25:40.025668+00
current_interval | 90 days
status           | next_run_scheduled
policy           | oneshot
retries_left     | 0
priority         |

I'm just waiting for the validation of D3896 to activate the tasks.

For info, on my desktop with the docker environment, with a limit of 100, the lister takes 3s to list the complete codeberg forge :

swh-lister_1                    | [2020-09-08 18:33:19,259: INFO/ForkPoolWorker-1] Task swh.lister.gitea.tasks.RangeGiteaLister[363e0b30-b13a-4f62-bd31-9847dfe62450] succeeded in 3.7196799100056523s: {'status': 'eventful'}

There is 3508 repositories detected.

the configuration is deployed and the listers were restarted.

The initial listing failed due to a concurrency problem. The problem is logged in sentry here : https://sentry.softwareheritage.org/share/issue/aec9c2af347e47ea84f51ace3bfe2f25/

It looks similar to T2070

I have tested to create a list-gitea-incremental task but it fails to but this time with another exception relative to an unexpected "sort" parameter : https://sentry.softwareheritage.org/share/issue/b0119b56f24347bcb58ac28c68685c62/

swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100
WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal'
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 1267302
  Next run: just now (2020-09-09 09:40:12+00:00)
  Interval: 1 day, 0:00:00
  Type: list-gitea-incremental
  Policy: oneshot
  Args:
  Keyword args:
    limit: 100
    url: 'https://codeberg.org/api/v1/'

The concurrency issue was reproduced locally on the docker environment with a concurrency of 5.

It seems the pages are listed several times during the job execution :

swh-lister_1                    | [2020-09-09 14:04:05,742: INFO/ForkPoolWorker-4] listing repos starting at 10
swh-lister_1                    | [2020-09-09 14:04:06,052: INFO/ForkPoolWorker-4] listing repos starting at 11
swh-lister_1                    | [2020-09-09 14:04:13,819: INFO/ForkPoolWorker-3] listing repos starting at 10
...
swh-lister_1                    | [2020-09-09 14:04:05,621: INFO/ForkPoolWorker-1] listing repos starting at 30
swh-lister_1                    | [2020-09-09 14:04:05,970: INFO/ForkPoolWorker-1] listing repos starting at 31
swh-lister_1                    | [2020-09-09 14:04:10,282: INFO/ForkPoolWorker-2] listing repos starting at 30
swh-lister_1                    | [2020-09-09 14:04:10,949: ERROR/ForkPoolWorker-2] Task swh.lister.gitea.tasks.RangeGiteaLister[f25fb95c-fbf3-4ee6-9072-f4029d2d04c1] raised unexpected: IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "gitea_repo_pkey"\nDETAIL:  Key (uid)=(3567) already exists.\n')

The test of new version v0.1.4 including the fix on the the range split, the uid change and the incremental task fix is ok.

Deployment :

  • Database cleanup on db0:
swh-lister=# drop table gitea_repo;
DROP TABLE
  • Update of the loaders and restart them, from pergamon :
root@pergamon:~# clush -b -w @staging-workers 'apt-get update; apt install -y python3-swh.lister'
...
root@pergamon:~# clush -b -w @staging-workers 'dpkg -l python3-swh.lister'
---------------
worker[0-2].internal.staging.swh.network (3)
---------------
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name               Version              Architecture Description
+++-==================-====================-============-=================================================================
ii  python3-swh.lister 0.1.4-1~swh1~bpo10+1 all          Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...)
# restart
root@pergamon:~# clush -w @swh-workers -b systemctl restart swh-worker@loader_svn
  • Database model upgrade, from scheduler0
root@scheduler0:~# apt-get update && apt install python3-swh.lister                  
...
Unpacking python3-swh.lister (0.1.4-1~swh1~bpo10+1) over (0.1.2-1~swh1~bpo10+1) ...
Setting up python3-swh.lister (0.1.4-1~swh1~bpo10+1)
swhscheduler@scheduler0:~$ swh lister --db-url postgresql://swh-lister:*****@db0.internal.staging.swh.network:5432/swh-lister db-init
  • check on db0 :
swh-lister=# \d gitea_repo
                         Table "public.gitea_repo"
   Column    |            Type             | Collation | Nullable | Default 
-------------+-----------------------------+-----------+----------+---------
 name        | character varying           |           |          | 
 full_name   | character varying           |           |          | 
 html_url    | character varying           |           |          | 
 origin_url  | character varying           |           |          | 
 origin_type | character varying           |           |          | 
 last_seen   | timestamp without time zone |           | not null | 
 task_id     | integer                     |           |          | 
 uid         | character varying           |           | not null | 
 instance    | character varying           |           |          | 
Indexes:
    "gitea_repo_pkey" PRIMARY KEY, btree (uid)
    "ix_gitea_repo_full_name" btree (full_name)
    "ix_gitea_repo_instance" btree (instance)
    "ix_gitea_repo_name" btree (name)
  • scheduling of a new full import task, from scheduler0 :
swhscheduler@scheduler0:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100
  • all the repos are correctly imported without errors
swh-lister=# select count(*) from gitea_repo;
 count 
-------
  3506
(1 row)
  • test of the incremental task :
swhscheduler@scheduler0:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100
Sep 10 10:22:16 worker1 python3[273967]: [2020-09-10 10:22:16,897: INFO/MainProcess] Received task: swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584]  
Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,086: INFO/ForkPoolWorker-4] listing repos starting at 1
Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,315: INFO/ForkPoolWorker-4] Repositories already seen, stopping
Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,320: INFO/ForkPoolWorker-4] Task swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584] succeeded in 0.4189604769926518s: {'status': 'uneventful'}
vsellier reopened this task as Work in Progress.Sep 10 2020, 7:03 PM

reopened to validate to complete process from the listing to the loading of some repository

There is now 962 repos imported on the staging archive

swh=# select count(*) from origin where url like 'https://codeberg%';
 count 
-------
   962
(1 row)

This is a extract of some urls :

swh=# select * from origin where url like 'https://codeberg%' limit 10;
    id     |                          url                          
-----------+-------------------------------------------------------
  91370242 | https://codeberg.org/Freeyourgadget/Gadgetbridge
  91370243 | https://codeberg.org/steko/harris-matrix-data-package
  91370244 | https://codeberg.org/Codeberg/build-deploy-gitea.git
  91370245 | https://codeberg.org/Codeberg/gitea.git
  91370246 | https://codeberg.org/Booteille/Phantom
  91370247 | https://codeberg.org/Booteille/Nitterify
  91370248 | https://codeberg.org/booteille/invidition/
  91370349 | https://codeberg.org/infosechandbook/blog-content.git
  91382717 | https://codeberg.org/hayden/howl
 117690007 | https://codeberg.org/niklas-fischer/schlaues-buch.git

An email was sent on the swh-devel mailing list to ask for reviews.
The deployment in production will be performed in the middle of week 38 is no problems are raised.

That doesn't seem good/normal, does it?

Indeed, that's T2584 we opened last week, fixed by D3934.
We'll check with @anlambert so it lands soon and be deployed.

To continue the review without it right now, one should either replace %20 by
%2B which is not so great or find origins without + in it which is also not
great... ¯\_(ツ)_/¯

Cheers,

We'll check with @anlambert so it lands soon and be deployed.

Done

(we had a bit of a fight with debian package, thus the delay)

everything seems to work well, the production deployment will be done in T2608