Page MenuHomeSoftware Heritage

Deploy launchpad and gitea listers on production
Closed, MigratedEdits Locked

Description

The launchpad and gitea listers are deployed on the staging environment since one week. there was no negative feedbacks on them so they can be deployed in production

Event Timeline

vsellier changed the task status from Open to Work in Progress.Sep 17 2020, 10:39 AM
vsellier triaged this task as Normal priority.
vsellier created this task.

Actions :

  • deploy the new version of the lister on each worker
  • update the lister data model
  • create the new task-type on the scheduler
  • manually launch a listing to create high priority loading tasks for launchpad and gitea repository to ingest soon the repositories and not at the end of the current git queue
  • truncate lister cache to allow the recurring loading tasks to be created
  • schedule the recurring listing tasks for both repositories
  • make a first listing with 'high' priority 'oneshot' output tasks [1]
swhworker@worker01 $ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister launchpad --priority high
...
  • Add scheduler tasks about forges to list (will output standard recurring with no priority tasks)
swhscheduler@saatchi $ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh scheduler

[1] so they get ingested soon (and not when the current full load-git queue is empty <- never happens)

New version of the lister package deployed :

  • on workers
root@pergamon:~# clush -b -w @swh-workers 'apt-get update; apt install -y python3-swh.lister' 
...
root@pergamon:~# clush -b -w @swh-workers "dpkg -l python3-swh.lister"
---------------
worker[01-16] (16)
---------------
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name               Version              Architecture Description
+++-==================-====================-============-=================================================================
ii  python3-swh.lister 0.1.4-1~swh1~bpo10+1 all          Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...)
  • on the scheduler :
root@saatchi:~# apt update && apt install python3-swh.lister
...
Restarting services...
 systemctl restart gunicorn-swh-scheduler.service icinga2.service journalbeat.service postfix@-.service rabbitmq-server.service rpcbind.service ssh.service swh-scheduler-runner.service unbound.service
  • lister model updated from worker01:
swhworker@worker01:/etc/softwareheritage$ swh lister --db-url postgresql://*****@db.internal.softwareheritage.org:5432/swh-lister db-init
INFO:swh.lister.cli:Loading lister bitbucket
INFO:swh.lister.cli:Loading lister cgit
INFO:swh.lister.cli:Loading lister cran
INFO:swh.lister.cli:Loading lister debian
INFO:swh.lister.cli:Loading lister gitea
INFO:swh.lister.cli:Loading lister github
INFO:swh.lister.cli:Loading lister gitlab
INFO:swh.lister.cli:Loading lister gnu
INFO:swh.lister.cli:Loading lister launchpad
INFO:swh.lister.cli:Loading lister npm
INFO:swh.lister.cli:Loading lister packagist
INFO:swh.lister.cli:Loading lister phabricator
INFO:swh.lister.cli:Loading lister pypi
INFO:swh.lister.cli:Initializing database
INFO:swh.lister.core.models:Creating tables
INFO:swh.lister.cli:Calling init hook for debian
  • user guest granted to access the new tables :
swh-lister=>    grant select
swh-lister->    on all tables in schema public
swh-lister->    to guest;
GRANT
  • schedult tak-types created:
swhscheduler@saatchi:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.gitea
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea
INFO:swh.scheduler.cli.task_type:Create task type list-gitea-full in scheduler
INFO:swh.scheduler.cli.task_type:Create task type list-gitea-incremental in scheduler
swhscheduler@saatchi:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.launchpad
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.launchpad
INFO:swh.scheduler.cli.task_type:Create task type list-launchpad-full in scheduler
INFO:swh.scheduler.cli.task_type:Create task type list-launchpad-incremental in scheduler
INFO:swh.scheduler.cli.task_type:Create task type list-launchpad-new in scheduler
  • initial manual launchpad listing launched :
swhworker@worker02:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister launchpad --priority high
INFO:swh.core.config:Loading config file /etc/softwareheritage/lister.yml
INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/lister.yml
  • initial gitea lister launched :
swhworker@worker02:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister gitea --priority high url=https://codeberg.org/api/v1/ limit=100
...
INFO:root:listing repos starting at 1198
INFO:root:listing repos starting at 1199
INFO:root:listing repos starting at 1200
INFO:root:stopping after page 1200, no next link found

Results :

swh-lister=> select count(*) from gitea_repo limit 10;
 count 
-------
  3598
(1 row)

swh-lister=> select count(*) from launchpad_repo limit 10;
 count 
-------
 19602
(1 row)
  • Task types registered on the scheduler :
swhscheduler@saatchi:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.launchpad
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.launchpad
swhscheduler@saatchi:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.gitea    
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea
softwareheritage-scheduler=> select * from task_type where type like 'list-launchpad%' or type like 'list-gitea%';
            type            |              description               |                     backend_name                      | default_interval | min_interval | max_interval | backoff_factor | max_queue_length | num_retries | retry_delay 
----------------------------+----------------------------------------+-------------------------------------------------------+------------------+--------------+--------------+----------------+------------------+-------------+-------------
 list-gitea-full            | Full update of a Gitea instance        | swh.lister.gitea.tasks.FullGiteaRelister              | 90 days          | 90 days      | 90 days      |              1 |                  |             | 
 list-gitea-incremental     | Incremental update of a Gitea instance | swh.lister.gitea.tasks.IncrementalGiteaLister         | 1 day            | 1 day        | 1 day        |              1 |                  |             | 
 list-launchpad-full        | Full update of Launchpad               | swh.lister.launchpad.tasks.FullLaunchpadLister        | 90 days          | 90 days      | 90 days      |              1 |                  |             | 
 list-launchpad-incremental | Incremental update                     | swh.lister.launchpad.tasks.IncrementalLaunchpadLister | 1 day            | 1 day        | 1 day        |              1 |                  |             | 
 list-launchpad-new         | Update new entries of Launchpad        | swh.lister.launchpad.tasks.NewLaunchpadLister         | 1 day            | 1 day        | 1 day        |              1 |                  |             | 
(5 rows)
  • lister's cache truncated :
swh-lister=> truncate gitea_repo;
TRUNCATE TABLE
swh-lister=> truncate launchpad_repo;
TRUNCATE TABLE
  • recurring task for full listing created :
    • gitea
swhscheduler@saatchi:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add list-gitea-full url=https://codeberg.org/api/v1/ limit=100
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 337306005
  Next run: in 3 months (2020-12-16 12:43:30+00:00)
  Interval: 90 days, 0:00:00
  Type: list-gitea-full
  Policy: recurring
  Args:
  Keyword args:
    limit: 100
    url: 'https://codeberg.org/api/v1/'
  • launchpad:
swhscheduler@saatchi:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add list-launchpad-full
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 337306006
  Next run: just now (2020-09-17 12:46:37+00:00)
  Interval: 90 days, 0:00:00
  Type: list-launchpad-full
  Policy: recurring
  Args:
  Keyword args:
  • recurring task for increment listing creating :
    • gitea
swhscheduler@saatchi:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 337315168
  Next run: just now (2020-09-17 12:51:44+00:00)
  Interval: 1 day, 0:00:00
  Type: list-gitea-incremental
  Policy: recurring
  Args:
  Keyword args:
    limit: 100
    url: 'https://codeberg.org/api/v1/'
  • launchpad
swhscheduler@saatchi:~$ swh  scheduler --config-file /etc/softwareheritage/scheduler.yml task add list-launchpad-incremental
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 337314502
  Next run: just now (2020-09-17 12:51:21+00:00)
  Interval: 1 day, 0:00:00
  Type: list-launchpad-incremental
  Policy: recurring
  Args:
  Keyword args:

Current ingestion ongoing from the launchpad instance:

softwareheritage=> select count(*) from origin where url like 'https://git.launchpad.net%';
 count
-------
  3753
(1 row)

We can see some with [1]

codeberg gitea instance should be next to this one:

softwareheritage=> select count(*) from origin where url like 'https://codeberg.org%';
 count
-------
    17
(1 row)

[1] https://archive.softwareheritage.org/browse/search/?q=git.launchpad.net&with_visit=true&with_content=true

Progress status:

  • launchpad: 14022 / 19605
  • codeberg: 17 / 3599

As we triggered the listing first for lauchpad (oneshot tasks with high
priority) then for the gitea instance codeberg (tasks oneshot with same policy
high), the codeberg repositories will be ingested when the launchpad ones are
done.

(TIL) suggestion for later: Run the lister command with different --priority
(possible values are high, normal, low) so both forge can be ingested more in
parallel.

Progress status:

|-----------+-----------------+--------+------------+-----------------------+-----------------------------------------------|
| forge     | origin ingested | out of | difference | table (swh-lister db) | Note                                          |
|-----------+-----------------+--------+------------+-----------------------+-----------------------------------------------|
| launchpad |           19568 |  19605 |        -37 | launchpad_repo [1]    | probably failure in repositories              |
|           |                 |        |            |                       | (T2373 possible or 401 ¯\_(ツ)_/¯)            |
|-----------+-----------------+--------+------------+-----------------------+-----------------------------------------------|
| codeberg  |            3612 |   3599 |        +13 | gitea_repo [2]        | more origins because existing                 |
|           |                 |        |            |                       | origins prior to run (probably save code now) |
|-----------+-----------------+--------+------------+-----------------------+-----------------------------------------------|

[1]

softwareheritage=> select count(*) from origin where url like 'https://git.launchpad.net%';
 count
-------
 19568

[2]

softwareheritage=> select count(*) from origin where url like 'https://codeberg.org%';
 count
-------
  3612

T2616 for the analysis part.
The gist of this task is done.

Remaining part must be common enough with our loader stack.