Page MenuHomeSoftware Heritage

Deploy sourceforge lister on staging
Closed, MigratedEdits Locked

Event Timeline

ardumont triaged this task as High priority.May 6 2021, 3:10 PM
ardumont created this task.

Note that it helped yet but i reproduced the issue in jenkins *locally*.
Prior to that, other issues with our moving cogs (swh.core, etc...) prevented it
(other unrelated failures arose).

ardumont changed the task status from Open to Work in Progress.May 7 2021, 12:07 PM
ardumont moved this task from Weekly backlog to in-progress on the System administration board.

Installing the latest package on a worker:

swh lister run --help | grep sourceforge
  -l, --lister [bitbucket|cgit|cran|debian|gitea|github|gitlab|gnu|launchpad|npm|packagist|phabricator|pypi|sourceforge]

As expected, It's here ;)

Update the scheduler backend with the new task type:

# apt update; apt install -y python3-swh.lister
...
$ swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task-type register
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.archive
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.cran
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.debian
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.deposit
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.nixguix
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.npm
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin loader.pypi
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.bitbucket
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.cgit
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.cran
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.debian
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.github
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitlab
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gnu
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.launchpad
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.npm
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.packagist
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.phabricator
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.pypi
INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.sourceforge
INFO:swh.scheduler.cli.task_type:Create task type list-sourceforge-full in scheduler

(i did saatchi/prod as well)

Check everything is fine (it is):

psql service=admin-staging-swh-scheduler                                                      psql (12.6)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

swh-scheduler=> \conninfo
You are connected to database "swh-scheduler" as user "swh-scheduler" on host "db1.internal.staging.swh.network" (address "192.168.130.11") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
swh-scheduler=> \x
Expanded display is on.
swh-scheduler=> select * from task_type where type like 'list-source%';
-[ RECORD 1 ]----+---------------------------------------------------
type             | list-sourceforge-full
description      | Full update of a SourceForge instance
backend_name     | swh.lister.sourceforge.tasks.FullSourceForgeLister
default_interval | 90 days
min_interval     | 90 days
max_interval     | 90 days
backoff_factor   | 1
max_queue_length |
num_retries      |
retry_delay      |

-- make some more gentle default
swh-scheduler=> update task_type set max_queue_length=10, min_interval='30 days', max_interval='30 days', num_retries=3 where type='list-sourceforge-full';
UPDATE 1
swh-scheduler=> select * from task_type where type like 'list-source%';
-[ RECORD 1 ]----+---------------------------------------------------
type             | list-sourceforge-full
description      | Full update of a SourceForge instance
backend_name     | swh.lister.sourceforge.tasks.FullSourceForgeLister
default_interval | 90 days
min_interval     | 30 days
max_interval     | 30 days
backoff_factor   | 1
max_queue_length | 10
num_retries      | 3
retry_delay      |

(note: we may want to adapt those in the lister repository in the register function).

Schedule the new listing task:

probably want

swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task add list-sourceforge-full
Created 1 tasks

Task 22782916
  Next run: today (2021-05-07T14:01:29.354886+00:00)
  Interval: 90 days, 0:00:00
  Type: list-sourceforge-full
  Policy: recurring
  Args:
  Keyword args:

Scheduler runner picked it up:

May 07 14:01:31 scheduler0 swh[824184]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks list-sourceforge-full

That got picked and failed:

May 07 14:01:32 worker2 python3[218671]: [2021-05-07 14:01:32,495: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[1eb27c36-2f58-4a33-8c9d-10b15b98a294]
May 07 14:01:32 worker2 python3[218680]: [2021-05-07 14:01:32,541: ERROR/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[1eb27c36-2f58-4a33-8c9d-10b15b98a294] raised unexpected: TypeError("__init__() got an unexpected keyword argument 'credentials'")
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 385, in trace_task
                                             R = retval = fun(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 55, in __call__
                                             result = super().__call__(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 650, in __protected_call__
                                             return self.run(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/sentry_sdk/integrations/celery.py", line 161, in _inner
                                             reraise(*exc_info)
                                           File "/usr/lib/python3/dist-packages/sentry_sdk/_compat.py", line 57, in reraise
                                             raise value
                                           File "/usr/lib/python3/dist-packages/sentry_sdk/integrations/celery.py", line 156, in _inner
                                             return f(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/swh/lister/sourceforge/tasks.py", line 15, in list_sourceforge_full
                                             return SourceForgeLister.from_configfile().run().dict()
                                           File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 268, in from_configfile
                                             return cls.from_config(**config)
                                           File "/usr/lib/python3/dist-packages/swh/lister/pattern.py", line 255, in from_config
                                             return cls(scheduler=scheduler_instance, **config)
                                         TypeError: __init__() got an unexpected keyword argument 'credentials'

I'll adapt (but afk).

There is something else i need to update there anyway, the incremental task.

I'll adapt (but afk).

fixed.

There is something else i need to update there anyway, the incremental task.

done as well.

Deployment in progress.

Deployment in progress.

Package built, deployment done.

Added the incremental sourceforge task as well (staging).

INFO:swh.scheduler.cli.task_type:Create task type list-sourceforge-incremental in scheduler

Scheduled back the full listing task which got scheduled:

May 07 15:35:02 scheduler0 swh[824184]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks list-sourceforge-full

It's now running:

May 07 15:31:58 worker0 python3[230921]: [2021-05-07 15:31:58,779: INFO/MainProcess] lister@worker0.internal.staging.swh.network ready.
May 07 15:35:02 worker0 python3[230921]: [2021-05-07 15:35:02,091: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[ec00c8cd-ff5b-47df-adfd-a8c1884b9831]
May 07 15:35:06 worker0 python3[230930]: [2021-05-07 15:35:06,698: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/wiki' does not have any tools
May 07 15:35:07 worker0 python3[230930]: [2021-05-07 15:35:07,402: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/blog' does not have any tools

And we can see the new lister appear in the scheduler backend:

swh-scheduler=> \conninfo
You are connected to database "swh-scheduler" as user "swh-scheduler" on host "db1.internal.staging.swh.network" (address "192.168.130.11") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)

swh-scheduler=> select * from listers where name ='sourceforge';
                  id                  |    name     | instance_name |            created            | current_state |            updated
--------------------------------------+-------------+---------------+-------------------------------+---------------+-------------------------------
 4b19e941-5e25-4cb0-b55d-ae421d983e2f | sourceforge | main          | 2021-05-07 15:35:02.157958+00 | {}            | 2021-05-07 15:35:02.157958+00
(1 row)

It broke with the following, sentry should have more detail [1]

May 07 15:57:03 worker0 python3[230930]: [2021-05-07 15:57:03,547: ERROR/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[ec00c8cd-ff5b-47df-adfd-a8c1884b9831] raised unexpected: HTTPError('404 Client Error: Not Found for url: https://sourceforge.net/rest/p/fci-cu-library2/b396')

[1] https://sentry.softwareheritage.org/share/issue/06c779e53f7a47c582d8e551662fb65f/

1.3.1 [1] packaged and deployed on staging worker

Scheduled back there:

May 19 09:41:15 worker2 python3[1285423]: [2021-05-19 09:41:15,280: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[0d30a736-4f1d-491b-b703-13118a33b7fb]

[1] With the fixes from Alphare

It no longer stops on unexpected 404 ;)

May 19 09:58:54 worker2 python3[1285433]: [2021-05-19 09:58:54,339: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/fci-cu-library2/b396
May 19 09:59:52 worker2 python3[1285433]: [2021-05-19 09:59:52,095: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/manijshrestha/salesathi
May 19 09:59:58 worker2 python3[1285433]: [2021-05-19 09:59:58,657: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/mp-wrapper
May 19 10:00:11 worker2 python3[1285433]: [2021-05-19 10:00:11,044: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/pasoiu
May 19 10:00:15 worker2 python3[1285433]: [2021-05-19 10:00:15,673: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/sga-ds

Still running:

May 19 09:41:15 worker2 python3[1285423]: [2021-05-19 09:41:15,280: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[0d30a736-4f1d-491b-b703-13118a33b7fb]
May 19 09:41:20 worker2 python3[1285433]: [2021-05-19 09:41:20,376: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/wiki' does not have any tools
May 19 09:41:20 worker2 python3[1285433]: [2021-05-19 09:41:20,921: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/blog' does not have any tools
May 19 09:58:54 worker2 python3[1285433]: [2021-05-19 09:58:54,339: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/fci-cu-library2/b396
May 19 09:59:52 worker2 python3[1285433]: [2021-05-19 09:59:52,095: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/manijshrestha/salesathi
May 19 09:59:58 worker2 python3[1285433]: [2021-05-19 09:59:58,657: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/mp-wrapper
May 19 10:00:11 worker2 python3[1285433]: [2021-05-19 10:00:11,044: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/pasoiu
May 19 10:00:15 worker2 python3[1285433]: [2021-05-19 10:00:15,673: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/sga-ds
May 19 10:09:20 worker2 python3[1285433]: [2021-05-19 10:09:20,460: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/chaoticmoon/home' does not have any tools
May 19 10:14:29 worker2 python3[1285433]: [2021-05-19 10:14:29,447: WARNING/ForkPoolWorker-4] Project URL 'https://sourceforge.net/motorola/' does not match expected pattern
May 19 10:14:29 worker2 python3[1285433]: [2021-05-19 10:14:29,649: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/motorola/wiki' does not have any tools
May 19 10:14:30 worker2 python3[1285433]: [2021-05-19 10:14:30,136: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/motorola/discussion' does not have any tools
May 19 10:14:30 worker2 python3[1285433]: [2021-05-19 10:14:30,542: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/motorola/news' does not have any tools
May 19 10:50:00 worker2 python3[1285423]: [2021-05-19 10:50:00,600: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.IncrementalGitLabLister[0225ac44-3b3b-4653-af3f-110f547123d6]
May 19 10:50:00 worker2 python3[1285430]: [2021-05-19 10:50:00,767: INFO/ForkPoolWorker-1] Task swh.lister.gitlab.tasks.IncrementalGitLabLister[0225ac44-3b3b-4653-af3f-110f547123d6] succeeded in 0.14286251738667488s: {'pages': 1, 'origins': 0}
May 19 11:32:49 worker2 python3[1285433]: [2021-05-19 11:32:49,647: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 404 for URL https://sourceforge.net/rest/p/intel-sas
May 19 13:10:23 worker2 python3[1285433]: [2021-05-19 13:10:23,221: WARNING/ForkPoolWorker-4] Project URL 'https://sourceforge.net/mirror/' does not match expected pattern
May 19 15:27:45 worker2 python3[1285423]: [2021-05-19 15:27:45,729: INFO/MainProcess] Received task: swh.lister.debian.tasks.DebianListerTask[edd4782b-22e0-4db7-9fd1-c3c1b65c8930]
May 19 15:28:35 worker2 python3[1285430]: [2021-05-19 15:28:35,076: INFO/ForkPoolWorker-1] Task swh.lister.debian.tasks.DebianListerTask[edd4782b-22e0-4db7-9fd1-c3c1b65c8930] succeeded in 49.340988324955106s: {'pages': 9, 'origins': 11}
May 19 17:54:29 worker2 python3[1285423]: [2021-05-19 17:54:29,454: INFO/MainProcess] Received task: swh.lister.pypi.tasks.PyPIListerTask[beb8cebc-955c-47d6-9c6a-daa34cebe759]
May 19 18:04:04 worker2 python3[1285430]: [2021-05-19 18:04:04,660: INFO/ForkPoolWorker-1] Task swh.lister.pypi.tasks.PyPIListerTask[beb8cebc-955c-47d6-9c6a-daa34cebe759] succeeded in 575.1989205237478s: {'pages': 1, 'origins': 305318}
May 19 22:32:26 worker2 python3[1285433]: [2021-05-19 22:32:26,578: WARNING/ForkPoolWorker-4] Project URL 'https://sourceforge.net/arris/' does not match expected pattern
May 19 22:32:26 worker2 python3[1285433]: [2021-05-19 22:32:26,769: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/arris/wiki' does not have any tools
May 19 22:32:27 worker2 python3[1285433]: [2021-05-19 22:32:27,044: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/arris/discussion' does not have any tools
May 19 22:32:27 worker2 python3[1285433]: [2021-05-19 22:32:27,896: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/arris/news' does not have any tools
May 20 01:45:47 worker2 python3[1285433]: [2021-05-20 01:45:47,985: WARNING/ForkPoolWorker-4] Unexpected HTTP status code 500 for URL https://sourceforge.net/rest/p/sightexaminer

Note: Other listing is happening alongside thus the "noise" in the output

It finally stopped, albeit poorly, the remote closed the connection.

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

[1] https://sentry.softwareheritage.org/share/issue/881876c208e24e89a0eb753410c41f72/

Sorry for the delayed response. I'm assuming we'd like it better if the lister continued anyway in case of a "fatal" connection error, with maybe some sort of retry?

Sorry for the delayed response.

Don't worry about it, it's fine.

I'm assuming we'd like it better if the lister continued anyway in case of a "fatal"
connection error, with maybe some sort of retry?

Yes, it'd be neat. Also, if we had such implementation as a decorator like the other
retry we got [1], we could share it on other listers as well (I don't recall we have
this already but I guess that could happen with other listers as well).

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/utils.py

New lister deployed:

ii  python3-swh.lister 1.3.2-1~swh1~bpo10+1 all          Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...)

Task scheduled, let's see:

May 26 10:56:42 worker2 python3[1850675]: [2021-05-26 10:56:42,737: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[38d47353-1721-4fd6-adac-b96d494efca0]

It went through \o/:

INFO/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[38d47353-1721-4fd6-adac-b96d494efca0] succeeded in 89751.71981433034s: {'pages': 258764, 'origins': 338175}

Installed and triggered a run for the incremental task on staging:

May 27 15:10:48 worker0 python3[2047265]: [2021-05-27 15:10:48,563: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.IncrementalSourceForgeLister[15a585c3-73ab-4522-b799-f7f768c430a6]

Everything went fine as well:

May 27 19:25:29 worker0 python3[2047275]: [2021-05-27 19:25:29,379: INFO/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.IncrementalSourceForgeLister[15a585c3-73ab-4522-b799-f7f768c430a6] succeeded in 15280.69617000036s: {'pages': 818, 'origins': 1408}

I guess it's all fine now.
Remains to deploy this in production at some point.

roh, you know what i meant forge, not claim the task, resolve it... (anyway, closing)