Page MenuHomeSoftware Heritage

Deploy lister next-gen in staging
Closed, ResolvedPublic

Description

Current:

  • status on origin listed: T2998#57610
  • latest deployed python3-swh.lister version: v0.6.1

2 remaining listers (gnu, packagist) needs to be ported, they will be managed in dedicated task when the times come.

Event Timeline

ardumont triaged this task as Normal priority.Jan 27 2021, 11:38 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Jan 27 2021, 2:36 PM
ardumont moved this task from Weekly backlog to in-progress on the System administration board.
ardumont updated the task description. (Show Details)

gitlab instance deployed, status ok:

  • Update current task in scheduler to actually trigger now:
swh-scheduler=> select * from task where id=962048;
   id   |       type       |                                        arguments                                        |           next_run            | current_interval |         status         |  policy   | retries_left | priority
--------+------------------+-----------------------------------------------------------------------------------------+-------------------------------+------------------+------------------------+-----------+--------------+----------
 962048 | list-gitlab-full | {"args": [], "kwargs": {"url": "https://gitlab.inria.fr/api/v4/", "instance": "inria"}} | 2021-04-27 17:48:56.188577+00 | 90 days          | next_run_not_scheduled | recurring |            0 |
(1 row)
  • Then check run
Jan 27 17:48:30 worker0 python3[1167]: [2021-01-27 17:48:30,718: INFO/ForkPoolWorker-4] Task swh.lister.gitlab.tasks.FullGitLabRelister[62441d9c-9c07-4305-85eb-cb70cda23ea1] succeeded in 203.48368402300002s: {'pages': 145, 'origins': 2874}

And output:

swh-scheduler=> select count(*) from listed_origins where url like 'https://gitlab.inria.fr%';
 count
-------
  2874
(1 row)

Also added an incremental instance task.

cli run for the github instance, status ok:

swhworker@worker0:~$ dpkg -l python3-swh.lister | grep lister
ii  python3-swh.lister 0.5.4-1~swh1~bpo10+1 all          Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...)
swhworker@worker0:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister github
^C
$ psql service=staging-swh-scheduler
swh-scheduler=> select * from listers where instance_name='github';
                  id                  |  name  | instance_name |            created            |      current_state      |            updated
--------------------------------------+--------+---------------+-------------------------------+-------------------------+-------------------------------
 9a27a3ac-1e88-48e0-9a9b-37ba28817473 | github | github        | 2021-01-28 10:52:37.408887+00 | {"last_seen_id": 22865} | 2021-01-28 10:53:07.886609+00
(1 row)
swh-scheduler=> select count(*) from listed_origins where url like 'https://github.com/%';
 count
-------
  5000
(1 row)

cli run for the bitbucket lister, status ok:

$ swhworker@worker0:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister bitbucket incremental=True
WARNING:swh.lister.bitbucket.lister:No credentials set in configuration, using anonymous mode
^C
swhworker@worker0:~$

$ psql service=staging-swh-scheduler
psql (12.5)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

swh-scheduler=> select * from listers where instance_name='bitbucket';
                  id                  |   name    | instance_name |            created            | current_state |            updated
--------------------------------------+-----------+---------------+-------------------------------+---------------+-------------------------------
 c353a201-e4e1-42c2-b954-8a1c6c5928ae | bitbucket | bitbucket     | 2021-01-28 11:00:10.034268+00 | {}            | 2021-01-28 11:00:10.034268+00
(1 row)

swh-scheduler=> select count(*) from listed_origins where url like 'https://bitbucket%';
 count
-------
 10000
(1 row)

swh-scheduler=> select * from listers where instance_name='bitbucket';
                  id                  |   name    | instance_name |            created            |                      current_state                      |            updated
--------------------------------------+-----------+---------------+-------------------------------+---------------------------------------------------------+-------------------------------
 c353a201-e4e1-42c2-b954-8a1c6c5928ae | bitbucket | bitbucket     | 2021-01-28 11:00:10.034268+00 | {"last_repo_cdate": "2012-06-01T12:57:01.156999+00:00"} | 2021-01-28 11:00:32.443539+00
(1 row)

Lister phabricator deployed with one instance (swh), status ok:

swhworker@worker0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add list-phabricator-full url=https://forge.softwareheritage.org/api/diffusion.repository.search instance=swh
Created 1 tasks

Task 17089493
  Next run: today (2021-01-28T11:09:56.580260+00:00)
  Interval: 90 days, 0:00:00
  Type: list-phabricator-full
  Policy: recurring
  Args:
  Keyword args:
    instance: 'swh'
    url: 'https://forge.softwareheritage.org/api/diffusion.repository.search'

listed 180 origins with 2 pages:

Jan 28 11:10:00 worker0 python3[1157]: [2021-01-28 11:10:00,394: INFO/MainProcess] Received task: swh.lister.phabricator.tasks.FullPhabricatorLister[c7b95b0a-0f80-4b72-b7c6-0ea2df51ef02]
Jan 28 11:10:01 worker0 python3[2392]: [2021-01-28 11:10:01,708: INFO/ForkPoolWorker-6] Task swh.lister.phabricator.tasks.FullPhabricatorLister[c7b95b0a-0f80-4b72-b7c6-0ea2df51ef02] succeeded in 1.285880941999494s: {'pages': 2, 'origins': 180}

Status ok:

swh-scheduler=> select count(*) from listed_origins where url like 'https://forge.softwareheritage%';
 count
-------
   180
(1 row)

swh-scheduler=> select * from listers where instance_name='swh';
                  id                  |    name     | instance_name |            created            | current_state |            updated
--------------------------------------+-------------+---------------+-------------------------------+---------------+-------------------------------
 12ded103-af37-41ac-ae3a-3643bb17ecd5 | phabricator | swh           | 2021-01-28 11:09:40.631348+00 | {}            | 2021-01-28 11:09:40.631348+00
(1 row)
ardumont updated the task description. (Show Details)

one cgit lister scheduled, status, it finished ok but [1]

Jan 28 12:36:39 worker0 python3[29180]: [2021-01-28 12:36:39,717: INFO/MainProcess] Received task: swh.lister.cgit.tasks.CGitListerTask[9544dbd3-fa73-42d4-a194-36d82a2370ea]
Jan 28 12:41:46 worker0 python3[29190]: [2021-01-28 12:41:46,608: INFO/ForkPoolWorker-4] Task swh.lister.cgit.tasks.CGitListerTask[9544dbd3-fa73-42d4-a194-36d82a2370ea] succeeded in 306.8694303520024s: {'pages': 1, 'origins': 1070}

In scheduler, all is well:

swh-scheduler=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='cgit' and
l.instance_name='git-kernel';
 count
-------
  1070
(1 row)

[1]
Note that this lister seems to need some writing improvments though.
It seemed to have flushed the writing only at the end of the listing.
If that's the real behavior (i'll need to check), that won't bode well for relatively high dimensioned instance like the cgit eclispe instance for example.

Note that this lister seems to need some writing improvments though.
It seemed to have flushed the writing only at the end of the listing.
If that's the real behavior (i'll need to check), that won't bode well for relatively high dimensioned instance like the cgit eclispe instance for example.

cgit lister should flush origins after each page, which instance has been listed here ?

Some listers like debian might flush a large amount of origins per page, will be curious to see how it goes.

cgit lister should flush origins after each page, which instance has been listed here ?

yes we did not implement anything particular in the cgit implementation.
We left left that concern to the StatelessLister / Lister class.

https://git.kernel.org [1]

[1]

1121917 | list-cgit | {"args": [], "kwargs": {"url": "https://git.kernel.org", "instance": "git-kernel"}} | 2021-01-29 12:42:12.077249+00 | 1 day            | next_run_not_scheduled | recurring |            0 |

gitea lister instance (https://try.gitea.io/api/v1), status ok:

$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister gitea url=https://try.gitea.io/api/v1/ instance=try-gitea

swh-scheduler=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='gitea' and l.instance_name='try-gitea';
 count
-------
  6932
(1 row)
ardumont updated the task description. (Show Details)

lister-cran status: run ko [1]

swhworker@worker0:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister cran
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')()
...
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 TypeError: ["can not serialize 'Attribute' object"]>

[1] https://sentry.softwareheritage.org/share/issue/2cd53c7575834b1aaf65760b80bcbcef/

launchpad listing in progress and it seems to display the same behavior as the cgit lister (T2998#57500) [1]

T3003 opened to improve on such behavior.

[1] It's been running for a few minutes now and still nothing to see in the listing (although the worker is doing things).

swh-scheduler=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='launchpad' and l.instance_name='launchpad';
 count
-------
     0
(1 row)

lister launchpad run ko, see details [1]

[1] T3003#57551

Note that this lister seems to need some writing improvments though.
It seemed to have flushed the writing only at the end of the listing.
If that's the real behavior (i'll need to check), that won't bode well for relatively high dimensioned instance like the cgit eclispe instance for example.

cgit lister should flush origins after each page, which instance has been listed here ?

Some listers like debian might flush a large amount of origins per page, will be curious to see how it goes.

D4965 should take care of it.

For example, it unstuck the launchpad lister (T3003), the cgit lister with instance eclipse (which in apparence did nothing when i ran it).
Now they are actually listing and regularly writing alongside their origins (not only at the end).

And the gist of why? We use a generator passed along the scheduler.record_listed_origin api so that list was probably huge for some instances.
See the diff for more details ;)

Node patch D4965, this gets better, the launchpad listed origins:

swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='launchpad' and l.instance_name='launchpad';
             now              | count
------------------------------+-------
 2021-01-28 16:17:12.18842+00 | 20781
(1 row)

It's stuck with another issue now (a duplicated origin which is not supposed to happen, @anlambert is trying to reproduce it to fix it).
(I changed the status from KO to OK-ish in the description ¯\_(ツ)_/¯ )

lister-cran status: run ko [1]

swhworker@worker0:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/lister.yml swh lister run --lister cran
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')()
...
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 TypeError: ["can not serialize 'Attribute' object"]>

[1] https://sentry.softwareheritage.org/share/issue/2cd53c7575834b1aaf65760b80bcbcef/

Fixed. The issue was related to deployment in staging (swh.scheduler needed an update) not with the lister.

swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='CRAN' and l.instance_name='cran';
              now              | count
-------------------------------+-------
 2021-01-28 16:34:50.796025+00 | 17038
(1 row)

Status lister: ok (with local patch):

Adapt the current scheduled task on debian lister:

swh-scheduler=> update task
set arguments = '{"args": [], "kwargs": {"distribution": "Debian", "mirror_url": "http://deb.debian.org/debian/", "suites": ["stretch", "buster", "bullseye"], "components": ["main", "contrib", "non-free"]}}'
where type = 'list-debian-distribution';
UPDATE 1
swh-scheduler=> update task set status='next_run_not_scheduled', next_run=now() where type = 'list-debian-distribution';
UPDATE 1

Check listing scheduled:

$ journalctl -xef -u swh-worker@lister
Jan 28 17:36:10 worker0 python3[35747]: [2021-01-28 17:36:10,421: INFO/MainProcess] Received task: swh.lister.debian.tasks.DebianListerTask[f1750ff1-9c36-4555-a791-577d13256770]
Jan 28 17:38:42 worker0 python3[35758]: [2021-01-28 17:38:42,386: INFO/ForkPoolWorker-4] Task swh.lister.debian.tasks.DebianListerTask[f1750ff1-9c36-4555-a791-577d13256770] succeeded in 151.94653273699805s: {'pages': 9, 'origins': 34845}

Check results in scheduler:

swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='debian' and l.instance_name='Debian';
              now              | count
-------------------------------+-------
 2021-01-28 17:43:12.617401+00 | 34845
(1 row)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)

pypi run triggered itself and went well all alone (cool):

Jan 28 17:45:08 worker0 python3[35747]: [2021-01-28 17:45:08,184: INFO/MainProcess] Received task: swh.lister.pypi.tasks.PyPIListerTask[00a4cfb2-6bad-461e-8784-5c931413474f]
Jan 28 17:48:52 worker0 python3[35758]: [2021-01-28 17:48:52,274: INFO/ForkPoolWorker-4] Task swh.lister.pypi.tasks.PyPIListerTask[00a4cfb2-6bad-461e-8784-5c931413474f] succeeded in 224.08578772700275s: {'pages': 1, 'origins': 285962}
swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='pypi' and l.instance_name='pypi';
              now              | count
-------------------------------+--------
 2021-01-28 17:52:57.325469+00 | 285962
(1 row)

npm run scheduled, run in progress:

swhworker@worker0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add list-npm-full
Created 1 tasks

Task 558604
  Next run: Jan 30 (2021-01-30T16:57:12.857527+00:00)
  Interval: 7 days, 0:00:00
  Type: list-npm-full
  Policy: recurring
  Args:
  Keyword args:

In progress listing:

swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='npm' and l.instance_name='npm';
              now              | count
-------------------------------+-------
 2021-01-28 18:00:17.052938+00 | 45995
(1 row)

npm listing done, so status ok as well:

Jan 28 19:23:18 worker0 python3[35758]: [2021-01-28 19:23:18,509: INFO/ForkPoolWorker-4] Task swh.lister.npm.tasks.NpmListerTask[2adad3a2-b054-4152-a31c-e10ac55589f4] succeeded in 5220.606264057002s: {'pages': 1507, 'origins': 1505605}

And, listed_origins in scheduler:

swh-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id and l.name='npm' and l.instance_name='npm';
              now              |  count
-------------------------------+---------
 2021-01-29 08:09:58.457384+00 | 1505605
(1 row)

Current status of all listings:

swh-scheduler=> select now(), l.name, l.instance_name, count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id group by (l.name, l.instance_name);
             now              |    name     | instance_name |  count
------------------------------+-------------+---------------+---------
 2021-01-29 08:10:59.82346+00 | bitbucket   | bitbucket     |   11000
 2021-01-29 08:10:59.82346+00 | cgit        | eclipse       |     900
 2021-01-29 08:10:59.82346+00 | cgit        | git-kernel    |    1070
 2021-01-29 08:10:59.82346+00 | CRAN        | cran          |   17038
 2021-01-29 08:10:59.82346+00 | debian      | Debian        |   34845
 2021-01-29 08:10:59.82346+00 | gitea       | try-gitea     |    6932
 2021-01-29 08:10:59.82346+00 | github      | github        |    5000
 2021-01-29 08:10:59.82346+00 | gitlab      | inria         |    2879
 2021-01-29 08:10:59.82346+00 | launchpad   | launchpad     |   20782
 2021-01-29 08:10:59.82346+00 | npm         | npm           | 1505605
 2021-01-29 08:10:59.82346+00 | phabricator | swh           |     180
 2021-01-29 08:10:59.82346+00 | pypi        | pypi          |  285962
(12 rows)
ardumont updated the task description. (Show Details)
ardumont moved this task from in-progress to deployed/landed on the System administration board.
ardumont claimed this task.
ardumont moved this task from deployed/landed to done on the System administration board.