After showcasing the first results of my work on T3127 to display the distribution of origins per forge, @rdicosmo noticed that the number of gitlab.com origins is wrong (T3127#67579).
Indeed the number of listed gilab.com origins in the scheduler database is around 200k but if we compare that number to the count of gitlab.com origins in storage database:
softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%'; count --------- 1023499 (1 row)
we clearly miss a lot of origins after the listing process.
So I played around with the gitlab lister in docker environment to see how many origins it lists for gitlab.com.
Turns out it encountered a lot of HTTP errors that make the listing process stop early.
During my tests, the lister encountered HTTP 502, 503 and 520 errors:
swh-lister_1 | [2021-07-22 14:09:27,575: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 503 Server Error: Service Unavailable for url: https://gitlab.com/api/v4/projects?id_after=835919&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=20&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.
swh-lister_1 | [2021-07-22 18:17:30,225: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 502 Server Error: Bad Gateway for url: https://gitlab.com/api/v4/projects?id_after=172057&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.
swh-lister_1 | [2021-07-22 21:07:28,639: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 520 Server Error: for url: https://gitlab.com/api/v4/projects?id_after=3963192&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.
those correspond to temporarily server failures and can be mitigated by adapting the retry policy of the lister.
The lister also encounters HTTP 500 errors:
swh-lister_1 | [2021-07-22 19:17:26,468: WARNING/ForkPoolWorker-1] Unexpected HTTP status code 500 on https://gitlab.com/api/v4/projects?id_after=2113709&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false: b'{"message":"500 Internal Server Error"}'
swh-lister_1 | [2021-07-23 03:14:57,730: WARNING/ForkPoolWorker-1] Unexpected HTTP status code 500 on https://gitlab.com/api/v4/projects?id_after=8354264&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false: b'{"message":"500 Internal Server Error"}'
Those errors come from the gitlab side and the only way to mitigate them is to skip those failing URLs and move to next page processing.
Turns out other GitLab API users encounter the same kind of errors as it exists an issue on the subject on GitLab bug tracker.
Those 500 errors have also been encountered in production so this is why the gitlab listing is incomplete.
I quickly hacked on swh-lister code to implement the mitigations and run gitlab lister once again in my docker environment.
diff --git a/swh/lister/gitlab/lister.py b/swh/lister/gitlab/lister.py index 7adf73b..8f56515 100644 --- a/swh/lister/gitlab/lister.py +++ b/swh/lister/gitlab/lister.py @@ -17,7 +17,7 @@ from tenacity.before_sleep import before_sleep_log from swh.lister import USER_AGENT from swh.lister.pattern import CredentialsType, Lister -from swh.lister.utils import retry_attempt, throttling_retry +from swh.lister.utils import retry_attempt, throttling_retry, is_retryable_exception from swh.scheduler.model import ListedOrigin logger = logging.getLogger(__name__) @@ -56,7 +56,7 @@ def _if_rate_limited(retry_state) -> bool: isinstance(exc, HTTPError) and exc.response.status_code == codes.forbidden and int(exc.response.headers.get("RateLimit-Remaining", "0")) == 0 - ) + ) or is_retryable_exception(exc) return False @@ -117,6 +117,8 @@ class GitLabLister(Lister[GitLabListerState, PageResult]): {"Accept": "application/json", "User-Agent": USER_AGENT} ) + self.origins = set() + if len(self.credentials) > 0: cred = random.choice(self.credentials) logger.info( @@ -145,7 +147,18 @@ class GitLabLister(Lister[GitLabListerState, PageResult]): response.url, response.content, ) - response.raise_for_status() + if response.status_code == 500: + id_after = _parse_id_after(url) + while True: + next_id_after = id_after + 100 + url = url.replace(f"id_after={id_after}", f"id_after={next_id_after}") + response = self.session.get(url) + if response.status_code == 200: + break + else: + id_after = next_id_after + else: + response.raise_for_status() repositories: Tuple[Repository, ...] = tuple(response.json()) if hasattr(response, "links") and response.links.get("next"): next_page = response.links["next"]["url"] @@ -159,6 +172,8 @@ class GitLabLister(Lister[GitLabListerState, PageResult]): "pagination": "keyset", "order_by": "id", "sort": "asc", + "per_page": "100", + "simple": "true", } if id_after is not None: parameters["id_after"] = str(id_after) @@ -182,12 +197,14 @@ class GitLabLister(Lister[GitLabListerState, PageResult]): repositories = page_result.repositories if page_result.repositories else [] for repo in repositories: + self.origins.add(repo["http_url_to_repo"]) yield ListedOrigin( lister_id=self.lister_obj.id, url=repo["http_url_to_repo"], visit_type="git", last_update=iso8601.parse_date(repo["last_activity_at"]), ) + logger.debug("%s origins listed", len(self.origins)) def commit_page(self, page_result: PageResult) -> None: """Update currently stored state using the latest listed "next" page if relevant.
The process did not stop since yesterday evening and the current number of listed origins is 1173800 so the fixes seems
to do the trick.
Let's properly implement and test the fixes then.