Page MenuHomeSoftware Heritage

Reimplement PyPI lister using new Lister API
ClosedPublic

Authored by tenma on Jan 14 2021, 6:56 PM.

Details

Summary

This is a straight port of the old lister.
The lister has only full listing capability.
It scrapes pypi.org list of packages.
Rate-limiting was not encountered but is handled generically.

Related to T2956

Test Plan

tox

Diff Detail

Repository
rDLS Listers
Branch
pypi
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18383
Build 28405: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28404: arc lint + arc unit

Event Timeline

tenma edited the test plan for this revision. (Show Details)
tenma added a project: Lister.

Build has FAILED

Patch application report for D4867 (id=17237)

Could not rebase; Attempt merge onto c782275296...

Updating c782275..a51be23
Fast-forward
 requirements-test.txt                              |   1 +
 swh/lister/bitbucket/__init__.py                   |   9 +-
 swh/lister/bitbucket/lister.py                     | 269 +++++--
 swh/lister/bitbucket/models.py                     |  16 -
 swh/lister/bitbucket/tasks.py                      |  68 +-
 swh/lister/bitbucket/tests/conftest.py             |   2 +-
 .../tests/data/bb_api_repositories_page1.json      | 124 ++++
 .../tests/data/bb_api_repositories_page2.json      | 123 ++++
 ...ies,after=1970-01-01T00:00:00+00:00,pagelen=100 | 806 ---------------------
 .../https_api.bitbucket.org/empty_response.json    |   4 -
 .../data/https_api.bitbucket.org/response.json     |   1 -
 swh/lister/bitbucket/tests/test_lister.py          | 290 +++++---
 swh/lister/bitbucket/tests/test_tasks.py           |  81 +--
 swh/lister/pypi/__init__.py                        |   3 +-
 swh/lister/pypi/lister.py                          | 106 ++-
 swh/lister/pypi/tasks.py                           |  10 +-
 swh/lister/pypi/tests/test_lister.py               |  53 ++
 swh/lister/pypi/tests/test_tasks.py                |   5 +-
 18 files changed, 845 insertions(+), 1126 deletions(-)
 delete mode 100644 swh/lister/bitbucket/models.py
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/2.0_repositories,after=1970-01-01T00:00:00+00:00,pagelen=100
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/empty_response.json
 delete mode 120000 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/response.json
Changes applied before test
commit a51be23a3617d95a20c937e4ebd8d18bf3716861
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

commit 4dd90ca2f489f406ef924daad33832a38fef96b1
Author: tenma <tenma+swh@mailbox.org>
Date:   Wed Jan 13 15:44:07 2021 +0100

    [WIP] Reimplement Bitbucket lister using new Lister API
    
    The new lister has incremental and full listing capability.
    It can request the Bitbucket API in anonymous and HTTP basic authentication
    modes. Rate-limiting is not aggressive and is handled.
    Listing mode, credentials and pagination parameters can be updated after
    creation.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/96/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/96/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 14 2021, 6:59 PM
Harbormaster failed remote builds in B18380: Diff 17237!

Removed commented out code

Build has FAILED

Patch application report for D4867 (id=17239)

Could not rebase; Attempt merge onto c782275296...

Updating c782275..480caad
Fast-forward
 requirements-test.txt                              |   1 +
 swh/lister/bitbucket/__init__.py                   |   9 +-
 swh/lister/bitbucket/lister.py                     | 269 +++++--
 swh/lister/bitbucket/models.py                     |  16 -
 swh/lister/bitbucket/tasks.py                      |  68 +-
 swh/lister/bitbucket/tests/conftest.py             |   2 +-
 .../tests/data/bb_api_repositories_page1.json      | 124 ++++
 .../tests/data/bb_api_repositories_page2.json      | 123 ++++
 ...ies,after=1970-01-01T00:00:00+00:00,pagelen=100 | 806 ---------------------
 .../https_api.bitbucket.org/empty_response.json    |   4 -
 .../data/https_api.bitbucket.org/response.json     |   1 -
 swh/lister/bitbucket/tests/test_lister.py          | 290 +++++---
 swh/lister/bitbucket/tests/test_tasks.py           |  81 +--
 swh/lister/pypi/__init__.py                        |   3 +-
 swh/lister/pypi/lister.py                          | 131 ++--
 swh/lister/pypi/tasks.py                           |  10 +-
 swh/lister/pypi/tests/test_lister.py               |  53 ++
 swh/lister/pypi/tests/test_tasks.py                |   5 +-
 18 files changed, 827 insertions(+), 1169 deletions(-)
 delete mode 100644 swh/lister/bitbucket/models.py
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/2.0_repositories,after=1970-01-01T00:00:00+00:00,pagelen=100
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/empty_response.json
 delete mode 120000 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/response.json
Changes applied before test
commit 480caadc455e282441ff56b81bb931e96fb35149
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

commit 4dd90ca2f489f406ef924daad33832a38fef96b1
Author: tenma <tenma+swh@mailbox.org>
Date:   Wed Jan 13 15:44:07 2021 +0100

    [WIP] Reimplement Bitbucket lister using new Lister API
    
    The new lister has incremental and full listing capability.
    It can request the Bitbucket API in anonymous and HTTP basic authentication
    modes. Rate-limiting is not aggressive and is handled.
    Listing mode, credentials and pagination parameters can be updated after
    creation.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/97/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/97/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 14 2021, 7:06 PM
Harbormaster failed remote builds in B18381: Diff 17239!
tenma edited the test plan for this revision. (Show Details)

Fix offending test, thanks anlambert

Build is green

Patch application report for D4867 (id=17240)

Could not rebase; Attempt merge onto c782275296...

Updating c782275..083e858
Fast-forward
 requirements-test.txt                              |   1 +
 swh/lister/bitbucket/__init__.py                   |   9 +-
 swh/lister/bitbucket/lister.py                     | 269 +++++--
 swh/lister/bitbucket/models.py                     |  16 -
 swh/lister/bitbucket/tasks.py                      |  68 +-
 swh/lister/bitbucket/tests/conftest.py             |   2 +-
 .../tests/data/bb_api_repositories_page1.json      | 124 ++++
 .../tests/data/bb_api_repositories_page2.json      | 123 ++++
 ...ies,after=1970-01-01T00:00:00+00:00,pagelen=100 | 806 ---------------------
 .../https_api.bitbucket.org/empty_response.json    |   4 -
 .../data/https_api.bitbucket.org/response.json     |   1 -
 swh/lister/bitbucket/tests/test_lister.py          | 290 +++++---
 swh/lister/bitbucket/tests/test_tasks.py           |  81 +--
 swh/lister/pypi/__init__.py                        |   3 +-
 swh/lister/pypi/lister.py                          | 131 ++--
 swh/lister/pypi/tasks.py                           |  10 +-
 swh/lister/pypi/tests/test_lister.py               |  53 ++
 swh/lister/pypi/tests/test_tasks.py                |   9 +-
 18 files changed, 829 insertions(+), 1171 deletions(-)
 delete mode 100644 swh/lister/bitbucket/models.py
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json
 create mode 100644 swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/2.0_repositories,after=1970-01-01T00:00:00+00:00,pagelen=100
 delete mode 100644 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/empty_response.json
 delete mode 120000 swh/lister/bitbucket/tests/data/https_api.bitbucket.org/response.json
Changes applied before test
commit 083e8585a50e3d2ece89e6f001befa0561d6312f
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

commit 4dd90ca2f489f406ef924daad33832a38fef96b1
Author: tenma <tenma+swh@mailbox.org>
Date:   Wed Jan 13 15:44:07 2021 +0100

    [WIP] Reimplement Bitbucket lister using new Lister API
    
    The new lister has incremental and full listing capability.
    It can request the Bitbucket API in anonymous and HTTP basic authentication
    modes. Rate-limiting is not aggressive and is handled.
    Listing mode, credentials and pagination parameters can be updated after
    creation.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/98/ for more details.

tenma requested review of this revision.Jan 14 2021, 7:36 PM

Build is green

Patch application report for D4867 (id=17241)

Rebasing onto c782275296...

Current branch diff-target is up to date.
Changes applied before test
commit 9de12fb5d2fef03177be38180a8cdd851a79a6a5
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/99/ for more details.

Use requests.Response.raise_for_status to handle errors generically

Build is green

Patch application report for D4867 (id=17245)

Rebasing onto c782275296...

Current branch diff-target is up to date.
Changes applied before test
commit 46a216ee820f87e88c7104b8c8e991f02813e6f8
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/100/ for more details.

tenma retitled this revision from [WIP] Reimplement PyPI lister using new Lister API to Reimplement PyPI lister using new Lister API.Jan 15 2021, 10:12 AM

Thanks!

I have made a few comments inline. All in all, as this lister makes a single request, I think we can drop the whole ratelimit handling logic which will make it much, much simpler.

You'll also need to remove the models.py file which should not be used any longer, as well as clean up the commented / stringified old code.

swh/lister/pypi/lister.py
21–29

This lister makes a single request, so I don't think the retry/rate limit logic is really needed. We can just restart the lister if it fails.

82

This shuffle isn't needed anymore: we can just insert all origins in order.

swh/lister/pypi/tasks.py
8–12

I guess this comment can go away.

swh/lister/pypi/tests/test_lister.py
19–41

Please remove this as it's not being used anymore.

48–52

I guess this would deserve a comment about what's in the mocked response, because that feels a bit magic.

Please also check the contents of the listed origins (that is, that the origin urls are what you'd expect).

55

If the ratelimit logic goes away, then this test can go away too (as well as the pages_content fixture).

olasd requested changes to this revision.Jan 15 2021, 11:16 AM
This revision now requires changes to proceed.Jan 15 2021, 11:16 AM

In short, let's KISS!

swh/lister/pypi/lister.py
21–29

Thanks, was not sure about that, so I made the conservative choice of doing because it was implicitly handled before and may happen in production.

Will remove.

  • remove useless code as per feeback
  • improve tests

Build is green

Patch application report for D4867 (id=17290)

Rebasing onto c782275296...

Current branch diff-target is up to date.
Changes applied before test
commit c087c93e8e2a81d93b846c50dff3d293e54ac713
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/108/ for more details.

swh/lister/pypi/lister.py
126

available on PyPI JSON API

swh/lister/pypi/tasks.py
17

Full listing of the PyPI registry

swh/lister/pypi/tests/test_lister.py
41

s/origins url/origin URLs/

64

you can remove that line

67

same here

swh/lister/pypi/tests/test_tasks.py
30

Use assert_called_once instead.

Looks fine to me, thanks. Please apply @anlambert's remaining comments before landing! :)

swh/lister/pypi/tests/test_tasks.py
30

assert_called_once() doesn't check what arguments have been used, which assert_called_once_with() does, so I think using it is fine?

This revision is now accepted and ready to land.Jan 18 2021, 5:14 PM
swh/lister/pypi/tests/test_tasks.py
30

Right, I should have proofread unittest doc, invalid comment here.

tenma marked 8 inline comments as done.

Clean code and fix page type

Build is green

Patch application report for D4867 (id=17331)

Rebasing onto 9fd91f007d...

Current branch diff-target is up to date.
Changes applied before test
commit a0ef8dfd795e96958a07847a6bd101f8dac1d9f8
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Jan 14 18:47:26 2021 +0100

    [WIP] Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/122/ for more details.

Remove "WIP" from the commit message
Remove the lister model

Build is green

Patch application report for D4867 (id=17400)

Rebasing onto 565e7423e3...

Current branch diff-target is up to date.
Changes applied before test
commit 62c825b8cb80aa66b4a04af88635a662c11b6fb2
Author: tenma <tenma+swh@mailbox.org>
Date:   Wed Jan 20 15:43:07 2021 +0100

    Reimplement PyPI lister using new Lister API
    
    The new lister has only full listing capability.
    It scrapes pypi.org list of packages.
    Rate-limiting was not encountered but is handled generically.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/123/ for more details.

This revision was automatically updated to reflect the committed changes.