Page MenuHomeSoftware Heritage

Make PyPI lister incremental and complete in regards to last_update
ClosedPublic

Authored by ardumont on Jul 7 2021, 4:20 PM.

Details

Summary

This rewrote the current implementation to actually use pypi's xml-rpc api which allows
to be incremental. It also allows to fetch the last release date per package. This last
part actually make it possible to update the "last_update" entry in the ListedOrigin
model.

Related to T3399

Test Plan

tox

actually run it within docker:

$ swh-doco exec swh-lister swh lister run --lister pypi
+ cd /home/tony/work/inria/repo/swh/swh-environment/docker
+ docker-compose -f docker-compose.yml -f docker-compose.override.yml exec swh-lister swh lister run --lister pypi
WARNING:swh.lister.pypi.lister:Retrying swh.lister.pypi.lister.PyPILister._changelog_since_serial in 1.0 seconds as it raised Fault: <Fault -32500: 'HTTPTooManyRequests: The action could not be performed because there were too many requests by the client. Limit may reset in 1 seconds.'>.
...

Within the db scheduler:

$ psql service=swh-scheduler-dev
11:08:51 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            | current_state |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}            | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {}            | 2021-07-09 08:34:56.26739+00  |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
(2 rows)

Time: 0.578 ms
11:10:27 swh-scheduler@localhost:5433=# select * from listed_origins;
...
+-[ RECORD 12678 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id              | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e                                                                                 |
| url                    | https://pypi.org/project/message/                                                                                    |
| visit_type             | pypi                                                                                                                 |
| extra_loader_arguments | {}                                                                                                                   |
| enabled                | t                                                                                                                    |
| first_seen             | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_seen              | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_update            | 2011-01-13 11:43:17+00                                                                                               |
+-[ RECORD 12679 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id              | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e                                                                                 |
| url                    | https://pypi.org/project/pytest-pep8/                                                                                |
| visit_type             | pypi                                                                                                                 |
| extra_loader_arguments | {}                                                                                                                   |
| enabled                | t                                                                                                                    |
| first_seen             | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_seen              | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_update            | 2010-12-06 15:38:50+00                                                                                               |
+------------------------+----------------------------------------------------------------------------------------------------------------------+

The run is actually done now and all last_update is filled in:

11:39:30 swh-scheduler@localhost:5433=# select count(*) from listed_origins ;
+--------+
| count  |
+--------+
| 390619 |
+--------+
(1 row)

Time: 42.307 ms
11:39:41 swh-scheduler@localhost:5433=# select count(*) from listed_origins where last_update is null;
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

11:41:03 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count  |
+--------+
| 390619 |
+--------+
(1 row)

Time: 3489.885 ms (00:03.490)  -- ^ just in case

The state got updated:

11:45:35 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            |       current_state       |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}                        | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {"last_serial": 10863341} | 2021-07-09 09:38:01.877813+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)

Time: 1.554 ms

Running another run, the lister is indeed incremental (the state of the lister got updated and 1 new origin got found):

11:45:43 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            |       current_state       |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}                        | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {"last_serial": 10863632} | 2021-07-09 09:46:48.406973+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)

Time: 2.386 ms
11:46:52 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count  |
+--------+
| 390620 |
+--------+
(1 row)

Time: 3504.825 ms (00:03.505)

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22514
Build 35081: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35080: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ardumont published this revision for review.Jul 7 2021, 4:29 PM

Build is green

Patch application report for D5977 (id=21538)

Rebasing onto bf7d44db3c...

Current branch diff-target is up to date.
Changes applied before test
commit 4ca1a05461390bb489b7077945cf9491012fff68
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Add incremental PyPI lister which fetches last_update per origin
    
    This introduces a new incremental behavior for the pypi lister to diminish further
    existing origins without any last update.
    
    With the incremental parameter set to False (the default for retrocompatibility), the
    existing listing happens as before, this uses the simple api lists everything that's
    exposed in the returned result. No last_update per origin is extracted from that
    endpoint.
    
    With the incremental parameter set to True, the listing actually only fetches the last
    changes since that last visit. This also uses the XML-RPC API exposed to retrieve only
    the last update for each of the origins listed. The state of the lister is then updated
    to the current visit date.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/321/ for more details.

Ensure the state is updated if relevant (current visit most recent than the last one)

swh/lister/pypi/tests/test_lister.py
151

mypy did not want it so fine it complained it wanted Tuple[datetime, List[object], Any] which is not as informative as ^.

Build is green

Patch application report for D5977 (id=21539)

Rebasing onto bf7d44db3c...

Current branch diff-target is up to date.
Changes applied before test
commit f2d183198e3ffad0d65e5510d1eb2fdda59c03c8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Add incremental PyPI lister which fetches last_update per origin
    
    This introduces a new incremental behavior for the pypi lister to diminish further
    existing origins without any last update.
    
    With the incremental parameter set to False (the default for retrocompatibility), the
    existing listing happens as before, this uses the simple api lists everything that's
    exposed in the returned result. No last_update per origin is extracted from that
    endpoint.
    
    With the incremental parameter set to True, the listing actually only fetches the last
    changes since that last visit. This also uses the XML-RPC API exposed to retrieve only
    the last update for each of the origins listed. The state of the lister is then updated
    to the current visit date.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/322/ for more details.

Nice, that lister should perform better once that feature deployed to production. I added a first batch of inline comments.

swh/lister/pypi/lister.py
33

since the pypi instance was visited

84

You could use a dataclass PackageUpdate instead of a tuple here for better readability.

85

Execute the listing of the last updates since last_visit_timestamp

92

The last visit timestamp

102

Is is not client.changelog instead here ?

Thanks!

swh/lister/pypi/lister.py
102

totally, here falls apart the mocking part ;)
nice catch!

swh/lister/pypi/tests/test_lister.py
148

you were right earlier, see this ^

ardumont added inline comments.
swh/lister/pypi/lister.py
84

totally, thx ;)

ardumont marked an inline comment as done.

Adapt according to judicious remarks (thanks ;)

Build is green

Patch application report for D5977 (id=21543)

Rebasing onto bf7d44db3c...

Current branch diff-target is up to date.
Changes applied before test
commit 3405226d147131bb07776e137f20e9624bd0e90b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Add incremental PyPI lister which fetches last_update per origin
    
    This introduces a new incremental behavior for the pypi lister to diminish further
    existing origins without any last update.
    
    With the incremental parameter set to False (the default for retrocompatibility), the
    existing listing happens as before, this uses the simple api lists everything that's
    exposed in the returned result. No last_update per origin is extracted from that
    endpoint.
    
    With the incremental parameter set to True, the listing actually only fetches the last
    changes since that last visit. This also uses the XML-RPC API exposed to retrieve only
    the last update for each of the origins listed. The state of the lister is then updated
    to the current visit date.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/323/ for more details.

Nice, that lister should perform better once that feature deployed to production.

Yes, hopefully ;)

It's also to decrease the origins without any last update in the scheduler. Today, pypi
origins are actually the ones with the most no_last_update entry [1]. Once, that's
dealt with, there is very few origin without it. That will definitely help the scheduler
to have a more appropriate behavior.

[1] see [2] for the actual computation.

| id                                   | name          | instance_name                | visit_type |     total | not_scheduled | no_snapshot | no_last_update |
|--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------|
| 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi          | pypi                         | pypi       |    316958 |          1748 |       10030 |         316958 |
...

[2]

10:01:00 softwareheritage-scheduler@belvedere:5432=> select id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type;
(46 rows)

From my quick testing, I have noticed that the changelog methods only returns a limited number of results, so we would need to iterate calls.

I think the properly paginated way of doing with the current PyPI XMLRPC api is to call:

  • changelog_last_serial() to get the highest serial, to be used as a termination condition. Currently returns 2168587
  • changelog_since_serial(<current_serial>) in a loop until the last serial returned is higher than the one set as termination condition. Looks like this returns 50k results per call.

(this will make us miss the last few updates that happened since the lister started, but this is probably marginal).

Then I assume the state kept across listings can just be the serial of the last changelog entry processed by the lister. This also avoids conversions between ints and datetimes in the state/db methods, as well as in the _last_updates_since method.

Finally, I think it would be nicer if you could mock the behavior/methods/return values of the ServerProxy class directly instead of hiding that in the _last_updates_since method.

Looking at the size of the changelog (2-ish million entries for 50k-ish pages means 50-ish requests), I /think/ the lister could always be running in incremental mode, rather than having to maintain two modes in the long run.

swh/lister/pypi/tests/test_lister.py
131

mock_pypi_xmlrpc?

In D5977#153851, @olasd wrote:

Looking at the size of the changelog (2-ish million entries for 50k-ish pages means 50-ish requests), I /think/ the lister could always be running in incremental mode, rather than having to maintain two modes in the long run.

And by this I mean that doing a full lister run (from an empty state) using the new incremental mode should still be pretty fast, and will provide more useful data than the other """api""" ever gave us.

Thanks.

From my quick testing, I have noticed that the changelog methods only returns a limited number of results, so we would need to iterate calls.

I think the properly paginated way of doing with the current PyPI XMLRPC api is to call:

  • changelog_last_serial() to get the highest serial, to be used as a termination condition. Currently returns 2168587
  • changelog_since_serial(<current_serial>) in a loop until the last serial returned is higher than the one set as termination condition. Looks like this returns 50k results per call.

(this will make us miss the last few updates that happened since the lister started, but this is probably marginal).

possible yes and like you i assume it's marginal enough to not pose that much problem.

Then I assume the state kept across listings can just be the serial of the last
changelog entry processed by the lister. This also avoids conversions between ints and
datetimes in the state/db methods, as well as in the _last_updates_since method.

Ok, so that means we can start from scratch, code-wise, with that lister, right?

Finally, I think it would be nicer if you could mock the behavior/methods/return
values of the ServerProxy class directly instead of hiding that in the
_last_updates_since method.

Yes, but as mentioned in the docstring of that method, standard mocking would not work.
I guess, if that's actually http underneath it all, I could try the similar approach as
the other tests with requests_mock but i'm not so sure about that.

Looking at the size of the changelog (2-ish million entries for 50k-ish pages means
50-ish requests), I /think/ the lister could always be running in incremental mode,
rather than having to maintain two modes in the long run.

Yes, i guess so.

The implementation in this diff kept both the implementation so it gave us the time to
backfill the existing values as mentioned in the associated task.

swh/lister/pypi/lister.py
86

@olasd to explicit ^ when mocking the ServerProxy, here is the issue with whatever methods i'd like to mock ("the *annoying* implementation detail)

AttributeError: <class 'xmlrpc.client.ServerProxy'> does not have the attribute 'last_serial'
AttributeError: <class 'xmlrpc.client.ServerProxy'> does not have the attribute 'changelog_since_serial'

Adapt according to review:

  • This rewrote completely the lister and stops using the simple api [1]
  • Uses the client.changelog_last_serial and client.changelog_since_serial endpoint

[1] This api is actually incomplete in regards to last update and does not really allow to be incremental

ardumont retitled this revision from Add incremental PyPI lister which fetches last_update per origin to Make PyPI lister incremental and complete in regards to last_update.
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D5977 (id=21565)

Rebasing onto fe01d08cd9...

First, rewinding head to replay your work on top of it...
Applying: Make PyPI lister incremental and complete in regards to last_update
Changes applied before test
commit b2011a8cb8c58a544ec9afc90ea1b445b4e8317c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/324/ for more details.

Build is green

Patch application report for D5977 (id=21566)

Rebasing onto fe01d08cd9...

Current branch diff-target is up to date.
Changes applied before test
commit 31e3c77409a162e367f25e51f6dcf86833ab8226
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/325/ for more details.

Build is green

Patch application report for D5977 (id=21567)

Rebasing onto 698be475e9...

Current branch diff-target is up to date.
Changes applied before test
commit 77f7da32e06361e8a4c860ad8c884582e5804796
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/326/ for more details.

Running through docker, i actually need to change a few things:

  • throttling needs to change (D5983)
  • actual logic of sending all origins in one go won't be ok for the first run (no flush prior to actually finish the run...) so i'll need to rework this (in-progress) [1]

[1] That should even simplify and avoid one indirection.

  • Handle throttling appropriately (adapt throttling decorators [1])
  • Rework logic to still deal with pages of results so the write in dbs occur incrementally

[1] Still need to update the test for it though

Build is green

Patch application report for D5977 (id=21569)

Rebasing onto 698be475e9...

Current branch diff-target is up to date.
Changes applied before test
commit f4bc205216c88fa8517609a3ce7a02b60477da16
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/328/ for more details.

ardumont edited the test plan for this revision. (Show Details)

Looks good, thanks a lot!

I made a few suggestions for (what I think are) clearer variable names, but the logic seems sound.

How much did you get throttled by the upstream API server in practice?

How does the result compare with the origins we've noticed in prod?

swh/lister/pypi/lister.py
66

maybe self.last_processed_serial?

86

My suggestion would be to substitute xmlrpc.client.ServerProxy() with a (stubbed) class implementing just these two methods with hardcoded return values. I don't know how Mock() or MagicMock() behaves on classes which have dynamically generated attributes/methods, like the ones xmlrpc.client generates.

132–134

I'm a bit confused by these three variables. It seems that last_serial is never reused, so here's my proposal!

136–158
175–181
This revision is now accepted and ready to land.Jul 9 2021, 11:57 AM

Looks good, thanks a lot!

yeah, team work!

I made a few suggestions for (what I think are) clearer variable names, but the logic seems sound.

ack, will check.

How much did you get throttled by the upstream API server in practice?

I see only the one warning demonstrated in the test plan.

How does the result compare with the origins we've noticed in prod?

prod is late in regards to this run but the order of magnitude is the same:
prod: 317924 , docker: 390620

Thanks for the review.

How does the result compare with the origins we've noticed in prod?

prod is late in regards to this run but the order of magnitude is the same:
prod: 317924 , docker: 390620

Cool, it seems that the changelog is exhaustive then, which is great news. For the record, as I mentioned on IRC, I expect that this is caused not by our prod lister being late, but by the changelog seeing all the packages ever created, even if they've been subsequently deleted. The lister we have in prod only lists all the /currently/ available packages.

In a further refinement, we should look into setting enable = false for origins for which the last action in the changelog is a removal of the package. For now this will just trigger the loader on a bunch of removed packages, which will end up in a not_found state.

ardumont marked 2 inline comments as done.
ardumont edited the test plan for this revision. (Show Details)

Attend to the major part of the review (thx)

swh/lister/pypi/lister.py
66

+1

86

I actually don't know how to do what you suggest, i'll check.
Thanks.

86

reading it more slowly, i think i got it.

132–134

thx, +1 again

Build is green

Patch application report for D5977 (id=21571)

Rebasing onto 698be475e9...

Current branch diff-target is up to date.
Changes applied before test
commit f8e5146c4d834c8e08a0ed991a900e4eba2c7fbb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/329/ for more details.

  • Adapt according to last suggestion about stub
  • and from the irc discussion, adding a sleep call to avoid the first throttling due to the initialization of the xmlrpc client [1]

[1] with this the run in docker no longer shows the first warning log about throttling \o/

Build is green

Patch application report for D5977 (id=21576)

Rebasing onto 698be475e9...

Current branch diff-target is up to date.
Changes applied before test
commit df46b220984d7187f0ef193f006d17cfd04cc41a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 7 16:14:09 2021 +0200

    Make PyPI lister incremental and complete in regards to last_update
    
    This rewrote the current implementation to actually use pypi's xml-rpc api which allows
    to be incremental. It also allows to fetch the last release date per package. This last
    part actually make it possible to update the "last_update" entry in the ListedOrigin
    model.
    
    Related to T3399

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/330/ for more details.