Diff Detail
- Repository
- rDLS Listers
- Branch
- lister-gitlab
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 1315 Build 1659: arc lint + arc unit
Event Timeline
I don't think "paging" is the right verb to describe the lister's action here. PageByPageLister or ByPageNumberLister would be more descriptive.
The implementation should remove references to indexes, but rather use pages or page_numbers.
Rather than using the linked list pattern, which works but doesn't help if we want to use parallelism, I think this lister base class should just do a for loop on a range of pages, or a while loop incrementing the page counter.
The run method would therefore take a minimum and maximum page number, and iterate through this page range.
To allow for parallelism, we need to get the number of pages at the beginning. Thankfully GitLab provides this with a header in the requests.
In my mind, the methods a PageByPageLister would need are:
- getting the data for a given page number (ingest_data(page_number), which itself calls safely_issue_request(page_number) which is implemented in the HTTP transport is fine for this purpose)
- getting the number of pages for the given request (this needs a new get_max_page_number_from_response(response) method that the HTTP transport can implement by looking at the 'X-Total-Pages' response header.
The run loop would then just increment the page number until it reaches the max_page_number. The grouped database flush logic can stay.
When that's done, the incremental lister can be implemented fairly easily by adding a hook to add an arbitrary stop condition for the run method: just walk the pages backwards until you find a repo whose id you've already seen before.
I don't think "paging" is the right verb to describe the lister's action here. PageByPageLister or ByPageNumberLister would be more descriptive.
Yes, sounds better. Going with 1.
The implementation should remove references to indexes, but rather use pages or page_numbers.
Ok.
Rather than using the linked list pattern, which works but doesn't help if we want to use parallelism, I think this lister base class should just do a for loop on a range of pages, or a while loop incrementing the page counter.
The run method would therefore take a minimum and maximum page number, and iterate through this page range.
To allow for parallelism, we need to get the number of pages at the beginning. Thankfully GitLab provides this with a header in the requests.
Yes, that's sensibly (except for the linked list thing) what i have in my latest commits (never got a chance yet to finish it since friday...).
As it's still wip, i did not push it.
It misses the actual implementation on the hook that says, 'dejavu, stop!'
In my mind, the methods a PageByPageLister would need are:
- getting the data for a given page number (ingest_data(page_number), which itself calls safely_issue_request(page_number) which is implemented in the HTTP transport is fine for this purpose)
- getting the number of pages for the given request (this needs a new get_max_page_number_from_response(response) method that the HTTP transport can implement by looking at the 'X-Total-Pages' response header.
ok, i have this already, will need to rework it a little, mostly the name are not that good (surprisingly!).
The run loop would then just increment the page number until it reaches the max_page_number. The grouped database flush logic can stay.
When that's done, the incremental lister can be implemented fairly easily by adding a hook to add an arbitrary stop condition for the run method: just walk the pages backwards until you find a repo whose id you've already seen before.
Ok, will see how i can adapt and finalize my actual wip.
Thanks.
- swh.lister.paging_lister: Improve lister's base class name
- swh.lister: Do not hardcode the index notion into parameter names
- swh.lister.gitlab: Improve headers extraction
- swh.lister.gitlab.tasks: Remove spurious comma
- swh.lister.core.paging_lister: Adding comments