Page MenuHomeSoftware Heritage

refactor github lister into something more generic
ClosedPublic

Authored by fiendish on Feb 18 2017, 6:12 PM.

Details

Summary

Code from the old github lister has been moved, genericized, and extended into lister and model base classes.
Hopefully it will make contribution of new listers easier.

There are also some preliminary unit tests included which stub out scheduler and database accesses.

This is currently in DLSGH by early suggestion. DLSGH will probably be renamed to DLS<something_else_more_generic>.

Test Plan

make test
add more tests
rinse repeat

currently does not test integration with scheduler or storage

Diff Detail

Repository
rDLS Listers
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

longer tests, more refactoring

Awesome. Seems good to me.

olasd will also take a stab at it.

Also, tests pass once dependencies are installed, packages i added:

  • python3-requests-mock
  • python3-sqlalchemy
  • python3-xmltodict
  • python3-arrow

Maybe, adding the pip names in requirements.txt would be good to.

Cheers,

rebase on origin/master, more tests + bug fixes

olasd requested changes to this revision.Feb 20 2017, 5:36 PM
olasd added a reviewer: olasd.
olasd added a subscriber: olasd.

Hey,

A lot of comments inline, I only reviewed the base class for now.

Please axe, with prejudice, all the boilerplate around db_session/requests_session as it should always be available as an instance attribute (those are leftovers of the day the lister was functional rather than object-oriented).

if not db_session:
    db_session = self.db_session
if not db_session:
    with session_scope(self.mk_session) as db_session:

I don't think there's any major objections, it's mostly some style/further cleanup stuff.

Once those have been taken care of I'll look at the rest of the diff!

swh/lister/core/lister.py
66 ↗(On Diff #512)

Should probably be PagingLister or a PaginatedLister (a lister that flips through pages, or a lister for a paginated resource).

We need a high level description of how the lister works and what the process is in the toplevel docstring for this base class, so we know what methods we need to override in subclasses (abstractmethods give a good hint, but aren't enough without a look through the algorithm).

104–124 ↗(On Diff #512)

Those two methods should be merged into a single method returning a swh.storage-compatible origin dict, so it's easier to extend in the future.

126–141 ↗(On Diff #512)

Clever hack although it makes pyflakes yell at the ALLCAPS attribute names. Add "# noqa: N802" to silence this for now, but this might be worth a refactor in swh.core.config.

137 ↗(On Diff #512)

Should be renamed cache_responses (it's not used in production so that's a free rename).

143–149 ↗(On Diff #512)

The lowercase attributes shouldn't be class attributes, but rather be instantiated in __init__.

163–167 ↗(On Diff #512)

It's counterintuitive to be overriding ALLCAPS class attributes in the constructor.

213–234 ↗(On Diff #512)

This function can return a "no retry" return value even when the server wants you to retry ("Retry-After" in the past or "just now"). In that case api_request will fall through and we'll try to parse the body of the 429 error.

Either we need to set a minimum retry cooloff, or distinguish between "request ok" and "retry immediately".

387–388 ↗(On Diff #512)

Even though it's not a frequent operation (once every few months), I'd really like to avoid doing a count(*) on a table of substantial size (as PostgreSQL will make that a sequential scan of the full table).

Maybe split that out into a separate overridable "get_repo_count" method so we can avoid it if possible, e.g. in the case of GitHub with its sequential indices.

448–460 ↗(On Diff #512)

Should be rendered useless by the repo_type/repo_url refactoring.

462–480 ↗(On Diff #512)

Should take the plain origin dict as argument.

502–503 ↗(On Diff #512)

Do we want to keep that around?

This revision now requires changes to proceed.Feb 20 2017, 5:36 PM
swh/lister/core/lister.py
213–234 ↗(On Diff #512)

I figure that getting a negative or 0 delay time will be due to signal latency. By the time you hear the message about when to retry, it's already ok to retry right away. But I can throw in a minimum time. What should it be?

448–460 ↗(On Diff #512)

Can you clarify this annotation?

462–480 ↗(On Diff #512)

Can you clarify this annotation?

502–503 ↗(On Diff #512)

Can you clarify this annotation?

Clarified the annotations inline.

swh/lister/core/lister.py
213–234 ↗(On Diff #512)

I don't think we need a minimal delay if the server tells us to retry right away. We just need to make sure the caller interprets that as a "retry needed now" answer rather than a "go ahead and process" as it seems to be now

448–460 ↗(On Diff #512)

If the repo_type/repo_url abstract methods are replaced by one that just returns the swh.storage-compatible origin dict, then this origin_dict method is not needed anymore.

462–480 ↗(On Diff #512)

If the repo_type/repo_url abstract methods are replaced by one that just returns the swh.storage-compatible origin dict, then this task_dict method should be fed that dict directly.

502–503 ↗(On Diff #512)

I don't think the within_bounds debug statement is needed?

Good job!

I mostly nitpicked in my comments, adding comments about where extra documentation would be nice to have plus a meta design point about how much we want to rely on the existence of an HTTP-based API for listing.

swh/lister/core/abstractattribute.py
7

Can we haz something more meaningful as doc here, e.g., a description of what abstract attributes are usually used for in the context of SWH listers.

swh/lister/core/lister.py
39 ↗(On Diff #518)

A brief description of what a "lister" in SWH context would be nice here. It's something that we are going to need to write anyhow as conceptual basis for presenting the API people will need to implement to realize a new lister. I guess it could go here. One short paragraph might be enough.

Additionally though, we need here developer-oriented documentation explaining what subclass implementors should implement to realize the various functionalities of a lister (e.g., full v. incremental listing).

170–204 ↗(On Diff #518)

So, it bothers me a little bit that this HTTP-specific part is in the base lister. Conceptually, to me, the lister should be agnostic w.r.t. the protocol used to list remote forges. We're already taking a big step here, so I don't think this should be blocking anything, but how critical is having this code here? And what would be the alternatives?

For instance, would moving this (plus everything HTTP specific) to a new HttpAPISwhLister sub-class, that would be uses as base for implementing all listers of forges that do use an HTTP-based API make sense? Or would it be overkill?

swh/lister/core/tasks.py
10

This module looks like a good place where to briefly describe the different listing "discplines" we support, one per task.

swh/lister/github/models.py
1

This makes me realize that the copyright years throughout your refactoring should be the range 2015-2017 in most cases (except for code you wrote from scratch).

fiendish edited edge metadata.
fiendish marked 19 inline comments as done.

shot at making the lister base and intermediate classes agnostic to the transport layer

what does this button do?

LGTM, i just followed up to a couple of minor pending issues — some are new, some follow-up to previous comments.

swh/lister/bitbucket/tests/api_response.txt
1 ↗(On Diff #550)

Can we pretty-print this, e.g. using json-glib-format from json-glib-tools, and possibly rename the file to a .json extension?
Same for api_empty_response.txt (previous hunk)

swh/lister/core/indexing_lister.py
16–18

minor numbering issue here :)

swh/lister/core/tasks.py
10

This is still pending — as shown by the TODO marked right below this hunk — right?

Do you prefer to merge leaving that pending or what?

swh/lister/github/tests/api_empty_response.txt
1 ↗(On Diff #550)

as above, please rename this to .json

swh/lister/github/tests/api_response.txt
1 ↗(On Diff #550)

ditto, please rename to .json, but otherwise this one is already pretty-printed, yay! :)

swh/lister/bitbucket/tests/api_response.txt
1 ↗(On Diff #550)

Into reformatting, but if I name it .json it will also need another variable pointing at the filename instead of using the same one everywhere.

swh/lister/core/tasks.py
10

I don't _want_ to leave it pending. I just don't have a good description of tasks to put in there. :)

Formatted the test responses and added commentary on the life goals of tasks.

zack added a reviewer: zack.
olasd edited edge metadata.
This revision is now accepted and ready to land.Mar 6 2017, 1:22 PM
In D165#3728, @fiendish wrote:

does this work?

@fiendish it does, but shouldn't be needed. When you push the accepted revision via git, differential should close the corresponding diff automatically.

In D165#3734, @zack wrote:
In D165#3728, @fiendish wrote:

does this work?

@fiendish it does, but shouldn't be needed. When you push the accepted revision via git, differential should close the corresponding diff automatically.

Only if you mention the diff number in the commit message (or use arc land) which wasn't done here.

zack mentioned this in Unknown Object (Maniphest Task).Mar 13 2017, 2:55 PM