Page MenuHomeSoftware Heritage

gitlab: Improve incremental listing
Needs ReviewPublic

Authored by anlambert on Aug 12 2022, 12:43 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

Incremental listing of a GitLab instance will now list repositories
modified since last listing date, previously only repositories created
since last listing date were listed.

We still benefit from GitLab keyset pagination with that extra filtering
so it seems those type of queries are well indexed in GitLab database.

This should help reducing the lag between archived GitLab repositories
and their upstream states.

Runtimes of HTTP queries are pretty fast wether the last modified date
filtering is used or not, see value of x-runtime response headers below
when simulating lister execution.

Without date filtering:

12:38 $ curl -I "https://gitlab.com/api/v4/projects?pagination=keyset&per_page=100&order_by=id&sort=asc"
HTTP/2 200 
date: Fri, 12 Aug 2022 10:39:06 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=8764&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: 68d9634e7ee22acf1a318d4fb6a6c28d
x-runtime: 0.760139
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-11-lb-gprd
gitlab-sv: api-gke-us-east1-d
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=mDBrt3ZDsjndbPa8Bhiv8PVRw8r6MUPtYaPR7akoCCCrJAow8dkS1qY2pI6XrqEP735SD20wv0Tr5xS8pwhwSc%2BgyjtSFmpGS0BnHoG4cKdpbMTFJ09ZwW5nd4wxTxCLAcvjd04vens%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739895c9bbab3fef-CDG

12:39 $ curl -I "https://gitlab.com/api/v4/projects?id_after=8764&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false"
HTTP/2 200 
date: Fri, 12 Aug 2022 10:39:47 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=12981&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: ecca8e5fa70c6015fa58eeac586d8c8d
x-runtime: 0.682019
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-30-lb-gprd
gitlab-sv: api-gke-us-east1-b
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=hWgB3Whrhqyy2uMarZNUOvhhIh3A0g6%2FDLJANfXsfDBDDXvjMNmXyOieUcUN6QBB9tQRLQ9B4EyGDhjSbTmk%2B77lQ95gTyjOMR%2BcJBb%2BXQUQEXTv7MNkp9WrvjRfrS2tGCmca%2BuDuPg%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739896cc9a81ee54-CDG

With date filtering:

12:39 $ curl -I "https://gitlab.com/api/v4/projects?pagination=keyset&per_page=100&order_by=id&sort=asc&last_activity_after=2022-08-11T17:29:23.175009+00:00"HTTP/2 200 
date: Fri, 12 Aug 2022 10:40:39 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=1351755&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: eb3a4364ed41af3fe24643329b473c8a
x-runtime: 0.896180
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-07-lb-gprd
gitlab-sv: api-gke-us-east1-c
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xWbehcX1dR6CsqrRRv931NAU66%2BkdkJvoxOQjjDiJT3EJulZ2qPP%2FzrBASAyCR2Wia%2F%2FAcFqemKfsnSSp2hNYXhQn9YdJgpGYVaigGCtjX1ELb6h0o%2BOLnmxq%2FpIiTimv759gKKbtL4%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 7398981159d6cd9b-CDG

12:41 $ curl -I "https://gitlab.com/api/v4/projects?id_after=1351755&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false"
HTTP/2 200 
date: Fri, 12 Aug 2022 10:41:34 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=2678032&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: 5782f1bf693eb20e0e3751af586a439c
x-runtime: 0.875874
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-08-lb-gprd
gitlab-sv: api-gke-us-east1-d
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=2WtkDeUBmQUsgfYwGNmR23MoCFr5hBvzqgEiKS1D%2FatUNjvF%2FfzTo0Ct6t4j16PWb4PlFdM0fWoJq95XlpusCZzjkg7%2Fsi63Jg4k%2BNKIfzSxsVGk5DYtC1yamwlefKI9HLB4N2Rxpeo%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739899680e8132b8-CDG

Related to P1420

Diff Detail

Repository
rDLS Listers
Branch
gitlab-incremental-lister-improvement
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30771
Build 48116: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 48115: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8240 (id=29719)

Rebasing onto cee6bcb514...

Current branch diff-target is up to date.
Changes applied before test
commit 0705dd1603d38eace0335e1fce765d6cac7c4990
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Aug 12 12:16:23 2022 +0200

    gitlab: Improve incremental listing
    
    Incremental listing of a GitLab instance will now list repositories
    modified since last listing date, previously only repositories created
    since last listing date were listed.
    
    We still benefit from GitLab keyset pagination with that extra filtering
    so it seems those type of queries are well indexed in GitLab database.
    
    This should help reducing the lag between archived GitLab repositories
    and their upstream states.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/583/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/583/console

Harbormaster returned this revision to the author for changes because remote builds failed.Aug 12 2022, 12:48 PM
Harbormaster failed remote builds in B30771: Diff 29719!

Build is green

Patch application report for D8240 (id=29719)

Rebasing onto cee6bcb514...

Current branch diff-target is up to date.
Changes applied before test
commit 0705dd1603d38eace0335e1fce765d6cac7c4990
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Aug 12 12:16:23 2022 +0200

    gitlab: Improve incremental listing
    
    Incremental listing of a GitLab instance will now list repositories
    modified since last listing date, previously only repositories created
    since last listing date were listed.
    
    We still benefit from GitLab keyset pagination with that extra filtering
    so it seems those type of queries are well indexed in GitLab database.
    
    This should help reducing the lag between archived GitLab repositories
    and their upstream states.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/584/ for more details.

Thanks, this would be a welcome change.

However, this:

We still benefit from GitLab keyset pagination with that extra filtering
so it seems those type of queries are well indexed in GitLab database.

is just not possible with what I can see of the upstream PostgreSQL design.

What happens is either :

  • for each page fetched, upstream's database server goes through rows by increasing id (starting from the "keyset pagination start id"), and only sends us the rows with recent last_activity_at. In that case, the number of rows parsed by the upstream database server over the run of the lister is one for every known repo -- we go through the whole list of repos.
  • or, for each page fetched, upstream's database server fetches all rows with recent last_activity_at, sorts them all by id, and returns 100 rows after the "keyset pagination id". In that case, the number of rows parsed by the upstream database server over the run of the lister is commensurate with the number of pages of results, and with the number of repos with recent activity, that is, it's commensurate with the square of the "number of repos with recent activity"

Using either one of these queries will depend on what the query planner thinks either index (id or last_activity_at is worth). This seems like a good area for creating some poorly controlled load on the server side (and an eventual emergency throttling of our requests), so I'd really like us to confirm with upstream that this combined filtering + keyset pagination is intended behavior, before we commit to it.

Using either one of these queries will depend on what the query planner thinks either index (id or last_activity_at is worth). This seems like a good area for creating some poorly controlled load on the server side (and an eventual emergency throttling of our requests), so I'd really like us to confirm with upstream that this combined filtering + keyset pagination is intended behavior, before we commit to it.

I found that GitLab epic related to the projects endpoint performance but did not find the answer related to performance impact when filtering on last_activity_at.
However, the epic is quite active, maybe we will get our answers by monitoring it.