Page MenuHomeSoftware Heritage

swh.lister.core: Make gitlab lister a paging lister instance
AbandonedPublic

Authored by ardumont on Jul 6 2018, 4:18 PM.

Details

Reviewers
olasd
Group Reviewers
Reviewers

Diff Detail

Repository
rDLS Listers
Branch
lister-gitlab
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1318
Build 1662: arc lint + arc unit

Event Timeline

olasd requested changes to this revision.Jul 6 2018, 7:13 PM
olasd added a subscriber: olasd.

I don't think "paging" is the right verb to describe the lister's action here. PageByPageLister or ByPageNumberLister would be more descriptive.

The implementation should remove references to indexes, but rather use pages or page_numbers.

Rather than using the linked list pattern, which works but doesn't help if we want to use parallelism, I think this lister base class should just do a for loop on a range of pages, or a while loop incrementing the page counter.

The run method would therefore take a minimum and maximum page number, and iterate through this page range.

To allow for parallelism, we need to get the number of pages at the beginning. Thankfully GitLab provides this with a header in the requests.

In my mind, the methods a PageByPageLister would need are:

  • getting the data for a given page number (ingest_data(page_number), which itself calls safely_issue_request(page_number) which is implemented in the HTTP transport is fine for this purpose)
  • getting the number of pages for the given request (this needs a new get_max_page_number_from_response(response) method that the HTTP transport can implement by looking at the 'X-Total-Pages' response header.

The run loop would then just increment the page number until it reaches the max_page_number. The grouped database flush logic can stay.

When that's done, the incremental lister can be implemented fairly easily by adding a hook to add an arbitrary stop condition for the run method: just walk the pages backwards until you find a repo whose id you've already seen before.

This revision now requires changes to proceed.Jul 6 2018, 7:13 PM

I don't think "paging" is the right verb to describe the lister's action here. PageByPageLister or ByPageNumberLister would be more descriptive.

Yes, sounds better. Going with 1.

The implementation should remove references to indexes, but rather use pages or page_numbers.

Ok.

Rather than using the linked list pattern, which works but doesn't help if we want to use parallelism, I think this lister base class should just do a for loop on a range of pages, or a while loop incrementing the page counter.

The run method would therefore take a minimum and maximum page number, and iterate through this page range.

To allow for parallelism, we need to get the number of pages at the beginning. Thankfully GitLab provides this with a header in the requests.

Yes, that's sensibly (except for the linked list thing) what i have in my latest commits (never got a chance yet to finish it since friday...).
As it's still wip, i did not push it.

It misses the actual implementation on the hook that says, 'dejavu, stop!'

In my mind, the methods a PageByPageLister would need are:

  • getting the data for a given page number (ingest_data(page_number), which itself calls safely_issue_request(page_number) which is implemented in the HTTP transport is fine for this purpose)
  • getting the number of pages for the given request (this needs a new get_max_page_number_from_response(response) method that the HTTP transport can implement by looking at the 'X-Total-Pages' response header.

ok, i have this already, will need to rework it a little, mostly the name are not that good (surprisingly!).

The run loop would then just increment the page number until it reaches the max_page_number. The grouped database flush logic can stay.

When that's done, the incremental lister can be implemented fairly easily by adding a hook to add an arbitrary stop condition for the run method: just walk the pages backwards until you find a repo whose id you've already seen before.

Ok, will see how i can adapt and finalize my actual wip.

Thanks.

  • swh.lister.paging_lister: Improve lister's base class name
  • swh.lister: Do not hardcode the index notion into parameter names

Fix according to latest changes

Fix conflict resolution misses (again)

  • swh.lister.gitlab: Improve headers extraction
  • swh.lister.gitlab.tasks: Remove spurious comma
  • swh.lister.core.paging_lister: Adding comments
  • swh.lister.gitlab: Add the presence check for incremental lister