Page MenuHomeSoftware Heritage

Bootstrap swh.lister.gitlab
ClosedPublic

Authored by ardumont on Jun 28 2018, 4:42 PM.

Details

Summary

It's wip as there remains stuff to add:

  • tests
  • checking how to filter appropriately (if we need it, i don't seem to find duplicates)
  • ensure rate limit policy is ok (doubt about the returned code)
  • remove what's not needed in the tasks module

But some review would be welcome already ;)

Local runs are fine:

  • origins are created in local swh db (good ones with actual https which loads through git loader)
  • local gitlab cache model is also populated (don't seem to find duplicates either)
  • tasks are scheduled
Test Plan

None yet, will be added

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

This first draft looks really good, thanks!

I'm afraid I have *ideas* about how this lister should look, and therefore here's a bunch of comments.

One cosmetic issue first to get it out of the way:

  • All class names should use GitLab (mixed case) instead of Gitlab, for consistency with the GitHub and BitBucket-related classes as well as the upstream capitalization.

There's one core functionality issue that needs to be fixed: right now gitlab.com is hardcoded :-) To fix that, I see a few things we need to do:

  • add an "instance" column to the model that would contain the instance URL/name
  • replace the model primary key with either a compound "instance + uid" primary key, or with a bespoke identifier
  • make sure credentials that are stored in the yml config are "indexed" by instance.
  • make sure the task classes take the instance URL as argument in their run_task method. To do so, I think we could make the ListerTaskBase class pass arguments to the new_lister function.

One more architectural question I have is whether it makes sense to force the use of the "indexing" lister, when the GitLab pagination is entirely different than either GitHub or BitBucket: using a sequential page number instead of a repository identifier (or a creation date) will break lots of assumptions of the SWHIndexingLister.run method. Going back to SWHListerHttpTransport + SWHListerBase instead may be less brittle. This would also allow use of the full functionality of the GitLab API, notably for parallelization and incremental listings.

A few GitLab API functionality comments:

  • the default sort order is unstable (I think it sorts by last activity descending); you'll want to add more arguments to the project list URL: order_by=id&sort=asc&per_page=100&simple=true
  • full listing could be parallelized even with an empty database, by using the X-Total-Pages header in the first reply
  • incremental listing could be done by scrolling the pages backwards (start at the X-Total-Pages and go down until we find a repository id we already have), or just by reversing the sort order (sort=desc instead of sort=asc)

Naturally, i forgot to submit my initial remarks... submitting.

README.md
51

Here, i'd like to discuss to move bin/ghlister (and the other files bound to the github lister) within the swh.lister.github module.
And for this use the more generic swh.lister.cli introduced in this diff.

swh/lister/gitlab/tasks.py
23 ↗(On Diff #1121)

Here, i need to dig further about opening only what's needed.
I used only RangeGitlabLister so far.

This first draft looks really good, thanks!

cool.

I'm afraid I have *ideas* about how this lister should look, and therefore here's a bunch of comments.

Ideas are cool!

One cosmetic issue first to get it out of the way:
All class names should use GitLab (mixed case) instead of Gitlab, for consistency with the GitHub and BitBucket-related classes as well as the upstream capitalization.

Ok I was wondering about the camel case use. I'll align then.

There's one core functionality issue that needs to be fixed: right now gitlab.com is hardcoded :-) To fix that, I see a few things we need to do:

Yes, i forgot to mention that was something to upgrade as well.
I have a local todo list and it's on it ;)

  • add an "instance" column to the model that would contain the instance URL/name
  • replace the model primary key with either a compound "instance + uid" primary key, or with a bespoke identifier
  • make sure credentials that are stored in the yml config are "indexed" by instance.
  • make sure the task classes take the instance URL as argument in their run_task method. To do so, I think we could make the ListerTaskBase class pass arguments to the new_lister function.

Sounds good!

One more architectural question I have is whether it makes sense to force the use of the "indexing" lister,

Well, it does not!

That's the main reason why i pushed it as wip, to discuss this.

when the GitLab pagination is entirely different than either GitHub or BitBucket: using a sequential page number instead of a repository identifier (or a creation date) will break lots of assumptions of the SWHIndexingLister.run method. Going back to SWHListerHttpTransport + SWHListerBase instead may be less brittle.

Yes.

Question is whether i implement it directly in the gitlab lister?

This would also allow use of the full functionality of the GitLab API, notably for parallelization and incremental listings.

ok, I would have to dig further for that part though.

A few GitLab API functionality comments:
the default sort order is unstable (I think it sorts by last activity descending);

on current api version (v4), i recall order by created_at desc. [1]

you'll want to add more arguments to the project list URL: order_by=id&sort=asc&per_page=100&simple=true

yes, i need to check if simple is enough though (from memory it's id, url, name, and path, might be enough :)

full listing could be parallelized even with an empty database, by using the X-Total-Pages header in the first reply
incremental listing could be done by scrolling the pages backwards (start at the X-Total-Pages and go down until we find a repository id we already have), or just by reversing the sort order (sort=desc instead of sort=asc)

Thanks for the heads up.

I'm afraid I have *ideas* about how this lister should look

Like i said, ideas are cool ;)
Thanks.

[1] https://gitlab.com/help/api/projects.md#list-all-projects

Cheers,

This revision was not accepted when it landed; it landed in state Needs Review.Jul 3 2018, 12:23 PM
This revision was automatically updated to reflect the committed changes.

Something went wrong here.

I separated the gitlab lister code from the amended documentation code in order to clarify the diff.
I pushed the master branch (only holding the commits regarding the documentation patches) to origin/master.
And that closed the diff (even though none of those commits holds any commit action in their message).

To explicit this, I did not want to close the diff.
And I do not see any way to open it again.

I will open a new diff centered solely on the gitlab lister code.