Page MenuHomeSoftware Heritage

Improve PyPI lister to pull last update information when running incrementally
Closed, MigratedEdits Locked

Description

Even though the XMLRPC api for PyPI is "on the way out", it's still the recommended way of subscribing to changes for packages.

Following the instructions at https://warehouse.pypa.io/api-reference/feeds.html, it should be possible for the PyPI lister to populate a "last update" field for most listed origins. This will help us to schedule the origin visits more effectively, and will reduce the loader thrashing on origins that haven't been updated since the last visit.

From a quick test, it looks like the "Project and release activity details" feed can go back multiple years without any issue, allowing us to backfill the data for all known origins, before adding the incremental behavior to the lister.

Event Timeline

olasd triaged this task as Normal priority.Jun 21 2021, 2:48 PM
olasd created this task.

Deployed in staging and triggered a run:

Jul 09 11:25:17 worker2 python3[1529532]: [2021-07-09 11:25:17,925: INFO/MainProcess] Received task: swh.lister.pypi.tasks.PyPIListerTask[355a33ea-0f9f-41f5-ad6e-c6c4caddd6c5]
Jul 09 12:01:06 worker2 python3[1529542]: [2021-07-09 12:01:06,062: INFO/ForkPoolWorker-4] Task swh.lister.pypi.tasks.PyPIListerTask[355a33ea-0f9f-41f5-ad6e-c6c4caddd6c5] succeeded in 2148.111491953954s: {'pages': 210, 'origins': 1519887}

There is now only 8 origins without last_update there:

$ psql service=staging-swh-scheduler
14:01:49 swh-scheduler@db1:5432=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='pypi' and last_update is null;
+-------+
| count |
+-------+
|     8 |
+-------+
(1 row)

Time: 270.575 ms

[1]

14:49:52 swh-scheduler@db1:5432=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null order by url;
+-------------------------------------------------+
|                       url                       |
+-------------------------------------------------+
| https://pypi.org/project/f-luhn/                |
| https://pypi.org/project/int-hash-int-hash-lib/ |
| https://pypi.org/project/linkedin-user-scraper/ |
| https://pypi.org/project/lm-decoder/            |
| https://pypi.org/project/lyra2rec0ban-hash/     |
| https://pypi.org/project/micro-api-ext/         |
| https://pypi.org/project/pokemon-yeet/          |
| https://pypi.org/project/rasa-print/            |
+-------------------------------------------------+
(8 rows)

From a quick test, it looks like the "Project and release activity details" feed can go back multiple years without any issue, allowing us to backfill the data for all known origins, before adding the incremental behavior to the lister.

The new implementation actually deals with the backfilling.

Deployed in production as well and triggered a run:

14:46:29 softwareheritage-scheduler@belvedere:5432=> select now(), count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null;
+------------------------------+-------+
|             now              | count |
+------------------------------+-------+
| 2021-07-09 12:46:48.01577+00 |     8 |
+------------------------------+-------+
(1 row)

Time: 8643.946 ms (00:08.644)
14:47:30 softwareheritage-scheduler@belvedere:5432=> select * from listers where name='pypi';
+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+
|                  id                  | name | instance_name |            created            |       current_state       |            updated            |
+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+
| 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi | pypi          | 2021-04-30 11:14:03.440526+00 | {"last_serial": 10864686} | 2021-07-09 12:46:07.863475+00 |
+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+
(1 row)

Time: 9.949 ms

This now displays 8 origins without any last_update. This is marginal enough to not bother too much about [1]
Given that we started at 316958 without any last_update and now we got 8, i'd say that's win enough.

[1]

14:50:02 softwareheritage-scheduler@belvedere:5432=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null order by url;
+-------------------------------------------------+
|                       url                       |
+-------------------------------------------------+
| https://pypi.org/project/f-luhn/                |
| https://pypi.org/project/int-hash-int-hash-lib/ |
| https://pypi.org/project/linkedin-user-scraper/ |
| https://pypi.org/project/lm-decoder/            |
| https://pypi.org/project/lyra2rec0ban-hash/     |
| https://pypi.org/project/micro-api-ext/         |
| https://pypi.org/project/pokemon-yeet/          |
| https://pypi.org/project/rasa-print/            |
+-------------------------------------------------+
(8 rows)

Time: 7172.285 ms (00:07.172)

(That's the same one as the staging infra)

ardumont claimed this task.

Deployed and running so closing.