Page MenuHomeSoftware Heritage

Reimplement the GitHub lister using the new pattern class
ClosedPublic

Authored by olasd on Jul 16 2020, 12:28 PM.

Details

Summary

This also replaces the test data with some manually generated answers, which
allows us to test a few more cases for instantiating the lister.

Depends on D3526.

Test Plan

tox with rewritten tests

Diff Detail

Event Timeline

Build is green

Patch application report for D3527 (id=12457)

Could not rebase; Attempt merge onto 5f1fbbe8a4...

Updating 5f1fbbe..17a011a
Fast-forward
 MANIFEST.in                                        |    3 +-
 conftest.py                                        |   19 +
 requirements-swh.txt                               |    2 +-
 swh/lister/__init__.py                             |   18 +-
 swh/lister/core/tests/conftest.py                  |    4 +-
 swh/lister/github/__init__.py                      |    3 +-
 swh/lister/github/lister.py                        |  231 +-
 swh/lister/github/models.py                        |   17 -
 swh/lister/github/tasks.py                         |   34 +-
 .../data/https_api.github.com/empty_response.json  |    1 -
 .../data/https_api.github.com/first_response.json  |    1 -
 .../data/https_api.github.com/next_response.json   | 6702 -------------------
 .../data/https_api.github.com/repositories,since=0 | 6706 --------------------
 swh/lister/github/tests/test_lister.py             |  243 +-
 swh/lister/github/tests/test_tasks.py              |   55 +-
 swh/lister/pattern.py                              |  236 +
 swh/lister/tests/test_cli.py                       |    9 +-
 swh/lister/tests/test_pattern.py                   |   93 +
 18 files changed, 757 insertions(+), 13620 deletions(-)
 delete mode 100644 swh/lister/github/models.py
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/empty_response.json
 delete mode 120000 swh/lister/github/tests/data/https_api.github.com/first_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/next_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/repositories,since=0
 create mode 100644 swh/lister/pattern.py
 create mode 100644 swh/lister/tests/test_pattern.py
Changes applied before test
commit 17a011ab2f9dddd82ed40d28335e4f655d10c426
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 12:12:31 2020 +0200

    Reimplement the GitHub lister using the new pattern class
    
    This also replaces the test data with some manually generated answers, which
    allows us to test a few more cases for instantiating the lister.

commit 91ccd2bd25682e8b1d2f6e0e5ea706fddd1f5abe
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 11:59:08 2020 +0200

    Hook up listers implemented with the new pattern to the CLI
    
    We stop depending on the ListerBase implementation. The main hoop we're jumping
    through is the config override mechanism in swh.lister.get_lister, as it's
    really specifc to the ListerBase `override_config` argument, which is dropped in
    pattern.Lister (in favor of overriding a `load_config` method).
    
    This generic configuration override mechanism will probably be completely
    superseded when all lister instances are reimplemented. We'll see.

commit 4d9db6e09f500285f2f1a484aeb6c11be674394c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jul 6 10:27:57 2020 +0200

    Intoduce a simpler base pattern for lister implementations.
    
    This new pattern uses the lister support features introduced in swh.scheduler to
    replace the database management done in previous iterations of the listers.

commit 211f4610df2006e939bb3f6e8f2cb7b87151dfba
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 11:45:58 2020 +0200

    Move get_scheduler monkeypatching into an explicit pytest fixture
    
    This allows us to actually run the lister instantiation code instead of relying
    on the underlying structure of the lister object. In turn, this allows future
    listers to use the scheduler right in their __init__.

commit d0c1df65f1f4b05cd9e766ee3916d25182ecdb25
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 10:27:54 2020 +0200

    Only include relevant data files in MANIFEST.in
    
    The previous include would pull all .mypy_cache directories too, which are quite
    large as they include the SQLAlchemy stubs.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/24/ for more details.

nice refactoring work !

swh/lister/github/lister.py
101

(very minor point) the per_page argument can be made a class attribute here

it is also referenced from the matching test class, but I'm not sure it needs to be kept in sync with the value here or not

Nice.

I have yet to read the production code part again (they say third is the charm ;)
But that looks good already.

Thanks.

(I did not finish reading the tests though, reaching the end of something is hard today for my part ¯\_(ツ)_/¯ ;)

Really cool work, thanks ;)

Simpler, concerns are separated, tested ;)

Exciting times ;)

We can comprehend what happens from the main lister code.

(I don't think that was the case before except in some more recent listers
which diverged voluntarily from the base code like cran or cgit i think ;)

I have a few remarks and questions above but nothing blocking ;)

swh/lister/github/lister.py
61

why the [:]?

(checking the top-level with a bare list, i don't see a difference)

73

Maybe explicit the current refresh policy chosen in the docstring.
(as model for other contributions)

Look, ma, explicit docstring, no need to read the code (<- most probably a lie heh ;)
84

Can we please have tests on that part?

(I recall i had changes to do there in the past, and with no test, it was
annoying ¯\_(ツ)_/¯)

swh/lister/github/tests/test_lister.py
45

unfinished docstring ;)

86

exists

150

lol, never seen that
(granted i must not read enough python ;)

Here is my test ;)

In [10]: 10 + True
Out[10]: 11

In [11]: 10 + False
Out[11]: 10
180

I don't understand the test (because i don't understand what state we start from).

This revision is now accepted and ready to land.Jul 17 2020, 9:49 AM
olasd marked 2 inline comments as done.

Much more test coverage around the ratelimit behavior, which means more code changes

olasd requested review of this revision.Jul 28 2020, 6:54 PM
olasd marked 4 inline comments as done.
olasd marked 2 inline comments as done.

Build is green

Patch application report for D3527 (id=12787)

Could not rebase; Attempt merge onto 211f4610df...

Updating 211f461..313f8ba
Fast-forward
 conftest.py                                        |    7 +-
 requirements-swh.txt                               |    2 +-
 swh/lister/__init__.py                             |   18 +-
 swh/lister/github/__init__.py                      |    3 +-
 swh/lister/github/lister.py                        |  300 +-
 swh/lister/github/models.py                        |   17 -
 swh/lister/github/tasks.py                         |   34 +-
 .../data/https_api.github.com/empty_response.json  |    1 -
 .../data/https_api.github.com/first_response.json  |    1 -
 .../data/https_api.github.com/next_response.json   | 6702 -------------------
 .../data/https_api.github.com/repositories,since=0 | 6706 --------------------
 swh/lister/github/tests/test_lister.py             |  485 +-
 swh/lister/github/tests/test_tasks.py              |   55 +-
 swh/lister/pattern.py                              |  239 +
 swh/lister/tests/test_cli.py                       |    7 +-
 swh/lister/tests/test_pattern.py                   |   93 +
 16 files changed, 1060 insertions(+), 13610 deletions(-)
 delete mode 100644 swh/lister/github/models.py
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/empty_response.json
 delete mode 120000 swh/lister/github/tests/data/https_api.github.com/first_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/next_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/repositories,since=0
 create mode 100644 swh/lister/pattern.py
 create mode 100644 swh/lister/tests/test_pattern.py
Changes applied before test
commit 313f8ba8477fc2a187dc190cacdfa9d200063c69
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jul 28 18:47:40 2020 +0200

    Reimplement the GitHub lister using the new pattern class
    
    This also replaces the test data with some manually generated answers, which
    allows us to test a few more cases for instantiating the lister.
    
    Widely expand test coverage to test behavior on rate-limited requests

commit 964e09764524ba4f4dfafb15ecbf72eaa927cfdb
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 11:59:08 2020 +0200

    Hook up listers implemented with the new pattern to the CLI
    
    We stop depending on the ListerBase implementation. The main hoop we're jumping
    through is the config override mechanism in swh.lister.get_lister, as it's
    really specifc to the ListerBase `override_config` argument, which is dropped in
    pattern.Lister (in favor of overriding a `load_config` method).
    
    This generic configuration override mechanism will probably be completely
    superseded when all lister instances are reimplemented. We'll see.

commit 7ab4f69692af27b47a41566fc6db7aeeba4dda66
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jul 6 10:27:57 2020 +0200

    Introduce a simpler base pattern for lister implementations.
    
    This new pattern uses the lister support features introduced in swh.scheduler to
    replace the database management done in previous iterations of the listers.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/27/ for more details.

douardda added inline comments.
swh/lister/github/lister.py
58

am I missing something or could this just be self.first_id = first_id ?

180

For some reason, github states that

Note: It's important to form calls with Link header values instead of constructing your own URLs.

which is not what we are doing here.
Not sure if it deserves a comment or not.

(see https://developer.github.com/v3/#pagination )

vlorentz added inline comments.
swh/lister/github/lister.py
180

I think we should use the Link directly; because 1. they ask us to 2. they might change their scheme without notice, and we would need to "reverse" it again.

swh/lister/github/lister.py
180

well, if we want to stick to this statement, then we cannot really implement the relisting behavior (with given limits), but only the (incremental) full listing (using this next link as stored state).

That's debatable, along with a few questions, like do we really need the relisting mode? if so, why not make it persistent (which would imply having a "lister run" identifier stored since there could be several listing concurrently in process for a given lister instance).

I'd say let's go for it as @olasd wrote for now...

ardumont added inline comments.
swh/lister/github/lister.py
58

looks that way yeah ;)

This revision is now accepted and ready to land.Aug 24 2020, 12:26 PM
olasd marked 2 inline comments as done.

Apply a bunch of comments and clarify some of the code (hopefully?)
Depends on D4700

olasd marked 3 inline comments as done.Dec 9 2020, 6:19 PM
olasd added inline comments.
swh/lister/github/lister.py
180

I've switched the code to use the url in the next link header directly. (We still parse it to recover the repo id).

Build is green

Patch application report for D3527 (id=16670)

Could not rebase; Attempt merge onto d2f4781669...

Updating d2f4781..2c7edd5
Fast-forward
 swh/lister/__init__.py                             |    9 +-
 swh/lister/github/__init__.py                      |    3 +-
 swh/lister/github/lister.py                        |  324 +-
 swh/lister/github/models.py                        |   17 -
 swh/lister/github/tasks.py                         |   34 +-
 .../data/https_api.github.com/empty_response.json  |    1 -
 .../data/https_api.github.com/first_response.json  |    1 -
 .../data/https_api.github.com/next_response.json   | 6702 -------------------
 .../data/https_api.github.com/repositories,since=0 | 6706 --------------------
 swh/lister/github/tests/test_lister.py             |  466 +-
 swh/lister/github/tests/test_tasks.py              |   62 +-
 swh/lister/pattern.py                              |  260 +
 swh/lister/tests/test_cli.py                       |   18 +-
 swh/lister/tests/test_pattern.py                   |  113 +
 14 files changed, 1105 insertions(+), 13611 deletions(-)
 delete mode 100644 swh/lister/github/models.py
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/empty_response.json
 delete mode 120000 swh/lister/github/tests/data/https_api.github.com/first_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/next_response.json
 delete mode 100644 swh/lister/github/tests/data/https_api.github.com/repositories,since=0
 create mode 100644 swh/lister/pattern.py
 create mode 100644 swh/lister/tests/test_pattern.py
Changes applied before test
commit 2c7edd50e8f72b6f6f28e200aeb18a9fdfa42eed
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Dec 9 18:15:28 2020 +0100

    Reimplement the GitHub lister using the new pattern class
    
    This replaces the test data with some manually generated answers, which allows
    us to test a few more cases for instantiating the lister.
    
    This also expands test coverage to test behavior on rate-limited requests.

commit 4dab6476a0916337a38d3311b2e3b908d19f438f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Nov 25 20:40:40 2020 +0100

    Add a helper to instantiate a new-style lister from a config file
    
    This helper will be used in the task entry points.

commit 45761dcb5f782ee3045184f4dbc208ce08a4fe25
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Jul 16 11:59:08 2020 +0200

    Hook up listers implemented with the new pattern to the CLI
    
    We stop depending on the ListerBase implementation. The main hoop we're jumping
    through is the config override mechanism in swh.lister.get_lister, as it's
    really specifc to the ListerBase `override_config` argument, which is dropped in
    pattern.Lister (in favor of explicit arguments at lister instantiation).
    
    We implement a small shim in swh.lister.pattern.Lister to give
    backwards-compatibility for the new pattern to get_lister.
    
    This generic configuration override mechanism will probably be completely
    removed when the configuration mechanism is reworked. We'll see.

commit 011928d262546c3e2af139efea854f6ea80a09e0
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jul 6 10:27:57 2020 +0200

    Introduce a simpler base pattern for lister implementations.
    
    This new pattern uses the lister support features introduced in swh.scheduler to
    replace the database management done in previous iterations of the listers.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/73/ for more details.