Page MenuHomeSoftware Heritage

Make GitLab lister more robust to HTTP errors
Closed, ResolvedPublic

Description

After showcasing the first results of my work on T3127 to display the distribution of origins per forge, @rdicosmo noticed that the number of gitlab.com origins is wrong (T3127#67579).

Indeed the number of listed gilab.com origins in the scheduler database is around 200k but if we compare that number to the count of gitlab.com origins in storage database:

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

we clearly miss a lot of origins after the listing process.

So I played around with the gitlab lister in docker environment to see how many origins it lists for gitlab.com.
Turns out it encountered a lot of HTTP errors that make the listing process stop early.

During my tests, the lister encountered HTTP 502, 503 and 520 errors:

swh-lister_1                     | [2021-07-22 14:09:27,575: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 503 Server Error: Service Unavailable for url: https://gitlab.com/api/v4/projects?id_after=835919&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=20&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.
swh-lister_1                     | [2021-07-22 18:17:30,225: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 502 Server Error: Bad Gateway for url: https://gitlab.com/api/v4/projects?id_after=172057&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.
swh-lister_1                     | [2021-07-22 21:07:28,639: WARNING/ForkPoolWorker-1] Retrying swh.lister.gitlab.lister.GitLabLister.get_page_result in 1.0 seconds as it raised HTTPError: 520 Server Error:  for url: https://gitlab.com/api/v4/projects?id_after=3963192&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false.

those correspond to temporarily server failures and can be mitigated by adapting the retry policy of the lister.

The lister also encounters HTTP 500 errors:

swh-lister_1                     | [2021-07-22 19:17:26,468: WARNING/ForkPoolWorker-1] Unexpected HTTP status code 500 on https://gitlab.com/api/v4/projects?id_after=2113709&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false: b'{"message":"500 Internal Server Error"}'
swh-lister_1                     | [2021-07-23 03:14:57,730: WARNING/ForkPoolWorker-1] Unexpected HTTP status code 500 on https://gitlab.com/api/v4/projects?id_after=8354264&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=true&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false: b'{"message":"500 Internal Server Error"}'

Those errors come from the gitlab side and the only way to mitigate them is to skip those failing URLs and move to next page processing.
Turns out other GitLab API users encounter the same kind of errors as it exists an issue on the subject on GitLab bug tracker.

Those 500 errors have also been encountered in production so this is why the gitlab listing is incomplete.

I quickly hacked on swh-lister code to implement the mitigations and run gitlab lister once again in my docker environment.

diff --git a/swh/lister/gitlab/lister.py b/swh/lister/gitlab/lister.py
index 7adf73b..8f56515 100644
--- a/swh/lister/gitlab/lister.py
+++ b/swh/lister/gitlab/lister.py
@@ -17,7 +17,7 @@ from tenacity.before_sleep import before_sleep_log
 
 from swh.lister import USER_AGENT
 from swh.lister.pattern import CredentialsType, Lister
-from swh.lister.utils import retry_attempt, throttling_retry
+from swh.lister.utils import retry_attempt, throttling_retry, is_retryable_exception
 from swh.scheduler.model import ListedOrigin
 
 logger = logging.getLogger(__name__)
@@ -56,7 +56,7 @@ def _if_rate_limited(retry_state) -> bool:
             isinstance(exc, HTTPError)
             and exc.response.status_code == codes.forbidden
             and int(exc.response.headers.get("RateLimit-Remaining", "0")) == 0
-        )
+        ) or is_retryable_exception(exc)
     return False
 
 
@@ -117,6 +117,8 @@ class GitLabLister(Lister[GitLabListerState, PageResult]):
             {"Accept": "application/json", "User-Agent": USER_AGENT}
         )
 
+        self.origins = set()
+
         if len(self.credentials) > 0:
             cred = random.choice(self.credentials)
             logger.info(
@@ -145,7 +147,18 @@ class GitLabLister(Lister[GitLabListerState, PageResult]):
                 response.url,
                 response.content,
             )
-        response.raise_for_status()
+        if response.status_code == 500:
+            id_after = _parse_id_after(url)
+            while True:
+                next_id_after = id_after + 100
+                url = url.replace(f"id_after={id_after}", f"id_after={next_id_after}")
+                response = self.session.get(url)
+                if response.status_code == 200:
+                    break
+                else:
+                    id_after = next_id_after
+        else:
+            response.raise_for_status()
         repositories: Tuple[Repository, ...] = tuple(response.json())
         if hasattr(response, "links") and response.links.get("next"):
             next_page = response.links["next"]["url"]
@@ -159,6 +172,8 @@ class GitLabLister(Lister[GitLabListerState, PageResult]):
             "pagination": "keyset",
             "order_by": "id",
             "sort": "asc",
+            "per_page": "100",
+            "simple": "true",
         }
         if id_after is not None:
             parameters["id_after"] = str(id_after)
@@ -182,12 +197,14 @@ class GitLabLister(Lister[GitLabListerState, PageResult]):
 
         repositories = page_result.repositories if page_result.repositories else []
         for repo in repositories:
+            self.origins.add(repo["http_url_to_repo"])
             yield ListedOrigin(
                 lister_id=self.lister_obj.id,
                 url=repo["http_url_to_repo"],
                 visit_type="git",
                 last_update=iso8601.parse_date(repo["last_activity_at"]),
             )
+        logger.debug("%s origins listed", len(self.origins))
 
     def commit_page(self, page_result: PageResult) -> None:
         """Update currently stored state using the latest listed "next" page if relevant.

The process did not stop since yesterday evening and the current number of listed origins is 1173800 so the fixes seems
to do the trick.

Let's properly implement and test the fixes then.

Event Timeline

anlambert triaged this task as Normal priority.Jul 23 2021, 12:14 PM
anlambert created this task.

As usual, very nice analysis \o/

For the record, my lister is still running, 1320500 gitlab.com origins listed so far.

Great, Thanks a lot.

I tagged a v1.5.0 and created a task [1] to follow through the deployment. I'll deploy
the gitlab lister with your fixes next week.

In the mean time, as you tested it on docker, landed the diff and it's now released.
This task can be considered done (shippable) and you can close it ;)

[1] Related to T3443

anlambert claimed this task.

Closing this as requested.

I am still running the gitlab.com lister in my docker env, 1454600 origins listed so far.
I will post the final number here when the listing ends.

swh-lister_1                     | [2021-07-24 08:58:25,127: INFO/ForkPoolWorker-1] Task swh.lister.gitlab.tasks.FullGitLabRelister[717c82b2-175a-492c-b701-22b4fd34e5e2] succeeded in 139476.781209187s: {'pages': 27470, 'origins': 2746838}