Page MenuHomeSoftware Heritage

Bootstrap swh.lister.gitlab
ClosedPublic

Authored by ardumont on Jun 28 2018, 4:42 PM.

Details

Summary

It's wip as there remains stuff to add:

  • tests
  • checking how to filter appropriately (if we need it, i don't seem to find duplicates)
  • ensure rate limit policy is ok (doubt about the returned code)
  • remove what's not needed in the tasks module

But some review would be welcome already ;)

Local runs are fine:

  • origins are created in local swh db (good ones with actual https which loads through git loader)
  • local gitlab cache model is also populated (don't seem to find duplicates either)
  • tasks are scheduled
Test Plan

None yet, will be added

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1264
Build 1608: arc lint + arc unit

Event Timeline

This first draft looks really good, thanks!

I'm afraid I have *ideas* about how this lister should look, and therefore here's a bunch of comments.

One cosmetic issue first to get it out of the way:

  • All class names should use GitLab (mixed case) instead of Gitlab, for consistency with the GitHub and BitBucket-related classes as well as the upstream capitalization.

There's one core functionality issue that needs to be fixed: right now gitlab.com is hardcoded :-) To fix that, I see a few things we need to do:

  • add an "instance" column to the model that would contain the instance URL/name
  • replace the model primary key with either a compound "instance + uid" primary key, or with a bespoke identifier
  • make sure credentials that are stored in the yml config are "indexed" by instance.
  • make sure the task classes take the instance URL as argument in their run_task method. To do so, I think we could make the ListerTaskBase class pass arguments to the new_lister function.

One more architectural question I have is whether it makes sense to force the use of the "indexing" lister, when the GitLab pagination is entirely different than either GitHub or BitBucket: using a sequential page number instead of a repository identifier (or a creation date) will break lots of assumptions of the SWHIndexingLister.run method. Going back to SWHListerHttpTransport + SWHListerBase instead may be less brittle. This would also allow use of the full functionality of the GitLab API, notably for parallelization and incremental listings.

A few GitLab API functionality comments:

  • the default sort order is unstable (I think it sorts by last activity descending); you'll want to add more arguments to the project list URL: order_by=id&sort=asc&per_page=100&simple=true
  • full listing could be parallelized even with an empty database, by using the X-Total-Pages header in the first reply
  • incremental listing could be done by scrolling the pages backwards (start at the X-Total-Pages and go down until we find a repository id we already have), or just by reversing the sort order (sort=desc instead of sort=asc)

Naturally, i forgot to submit my initial remarks... submitting.

README.md
51

Here, i'd like to discuss to move bin/ghlister (and the other files bound to the github lister) within the swh.lister.github module.
And for this use the more generic swh.lister.cli introduced in this diff.

swh/lister/gitlab/tasks.py
23

Here, i need to dig further about opening only what's needed.
I used only RangeGitlabLister so far.

This first draft looks really good, thanks!

cool.

I'm afraid I have *ideas* about how this lister should look, and therefore here's a bunch of comments.

Ideas are cool!

One cosmetic issue first to get it out of the way:
All class names should use GitLab (mixed case) instead of Gitlab, for consistency with the GitHub and BitBucket-related classes as well as the upstream capitalization.

Ok I was wondering about the camel case use. I'll align then.

There's one core functionality issue that needs to be fixed: right now gitlab.com is hardcoded :-) To fix that, I see a few things we need to do:

Yes, i forgot to mention that was something to upgrade as well.
I have a local todo list and it's on it ;)

  • add an "instance" column to the model that would contain the instance URL/name
  • replace the model primary key with either a compound "instance + uid" primary key, or with a bespoke identifier
  • make sure credentials that are stored in the yml config are "indexed" by instance.
  • make sure the task classes take the instance URL as argument in their run_task method. To do so, I think we could make the ListerTaskBase class pass arguments to the new_lister function.

Sounds good!

One more architectural question I have is whether it makes sense to force the use of the "indexing" lister,

Well, it does not!

That's the main reason why i pushed it as wip, to discuss this.

when the GitLab pagination is entirely different than either GitHub or BitBucket: using a sequential page number instead of a repository identifier (or a creation date) will break lots of assumptions of the SWHIndexingLister.run method. Going back to SWHListerHttpTransport + SWHListerBase instead may be less brittle.

Yes.

Question is whether i implement it directly in the gitlab lister?

This would also allow use of the full functionality of the GitLab API, notably for parallelization and incremental listings.

ok, I would have to dig further for that part though.

A few GitLab API functionality comments:
the default sort order is unstable (I think it sorts by last activity descending);

on current api version (v4), i recall order by created_at desc. [1]

you'll want to add more arguments to the project list URL: order_by=id&sort=asc&per_page=100&simple=true

yes, i need to check if simple is enough though (from memory it's id, url, name, and path, might be enough :)

full listing could be parallelized even with an empty database, by using the X-Total-Pages header in the first reply
incremental listing could be done by scrolling the pages backwards (start at the X-Total-Pages and go down until we find a repository id we already have), or just by reversing the sort order (sort=desc instead of sort=asc)

Thanks for the heads up.

I'm afraid I have *ideas* about how this lister should look

Like i said, ideas are cool ;)
Thanks.

[1] https://gitlab.com/help/api/projects.md#list-all-projects

Cheers,

This revision was not accepted when it landed; it landed in state Needs Review.Jul 3 2018, 12:23 PM
This revision was automatically updated to reflect the committed changes.

Something went wrong here.

I separated the gitlab lister code from the amended documentation code in order to clarify the diff.
I pushed the master branch (only holding the commits regarding the documentation patches) to origin/master.
And that closed the diff (even though none of those commits holds any commit action in their message).

To explicit this, I did not want to close the diff.
And I do not see any way to open it again.

I will open a new diff centered solely on the gitlab lister code.