In case of all retry tryouts failed, finally ignore that page and continue listing the
next pages.
Related to T3948
Differential D7198
launchpad: Ignore erratic page and continue listing ardumont on Feb 17 2022, 6:20 PM. Authored by
Details
In case of all retry tryouts failed, finally ignore that page and continue listing the Related to T3948 tox Applied on staging and triggered the listing. Feb 17 17:40:37 worker1 python3[2168786]: [2022-02-17 17:40:37,571: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.IncrementalGitLabLister[496b82a5-6e7a-4e39-841c-e9158334fc1e] Feb 17 17:40:37 worker1 python3[2168796]: [2022-02-17 17:40:37,796: INFO/ForkPoolWorker-4] Task swh.lister.gitlab.tasks.IncrementalGitLabLister[496b82a5-6e7a-4e39-841c-e9158334fc1e] succeeded in 0.19229740649461746s: {'pages': 1, 'origins': 0} It finished: Feb 17 20:36:17 worker1 python3[2168793]: [2022-02-17 20:36:17,861: INFO/ForkPoolWorker-1] Task swh.lister.launchpad.tasks.FullLaunchpadLister[d8bb1c71-4c2c-406c-9431-f358cdca5ec5] succeeded in 10058.709554322995s: {'pages': 2, 'origins': 216789}
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D7198 (id=26101)Rebasing onto 6a7479553e... Current branch diff-target is up to date. Changes applied before testcommit 3b3f39eb333733dc2f814ed55b9d808bb67dcc01 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Feb 17 18:16:43 2022 +0100 launchpad: Ignore erratic page and continue listing Related to T3948 See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/463/ for more details.
Comment Actions I misunderstood what is the repos object in the code so we did not use the throttling_retry decorator the right way. repos is an instance of lazr.restfulclient.resource import Collection which implements an iterator fetching pages from What we have to retry here is the next operation on the iterator as that call can raise a RestfulError exception. But you are right, if after the max attempts of retry we still get an exception, we have to stop the listing of the current page. Below is the diff about how I manage to implement what it is described above for you to get the idea: diff --git a/swh/lister/launchpad/lister.py b/swh/lister/launchpad/lister.py index 722b299..ccc2378 100644 --- a/swh/lister/launchpad/lister.py +++ b/swh/lister/launchpad/lister.py @@ -11,7 +11,8 @@ from typing import Any, Dict, Iterator, Optional, Tuple import iso8601 from launchpadlib.launchpad import Launchpad from lazr.restfulclient.errors import RestfulError -from lazr.restfulclient.resource import Collection +from lazr.restfulclient.resource import Collection, Resource +from tenacity.before_sleep import before_sleep_log from swh.lister.utils import retry_if_exception, throttling_retry from swh.scheduler.interface import SchedulerInterface @@ -99,10 +100,13 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]): d[attribute_name] = date_last_modified.isoformat() return d - @throttling_retry(retry=retry_if_restful_error) + @throttling_retry( + retry=retry_if_restful_error, + before_sleep=before_sleep_log(logger, logging.WARNING), + ) def _page_request( self, launchpad, vcs_type: str, date_last_modified: Optional[datetime] - ) -> Optional[Collection]: + ) -> Collection: """Querying the page of results for a given vcs_type since the date_last_modified. If some issues occurs, this will deal with the retrying policy. @@ -141,7 +145,13 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]): continue yield vcs_type, result - @throttling_retry(retry=retry_if_restful_error) + @throttling_retry( + retry=retry_if_restful_error, + before_sleep=before_sleep_log(logger, logging.WARNING), + ) + def get_next_repo(self, repos_it: Iterator[Resource]) -> Resource: + return next(repos_it) + def get_origins_from_page(self, page: LaunchpadPageType) -> Iterator[ListedOrigin]: """ Iterate on all git repositories and yield ListedOrigin instances. @@ -149,9 +159,10 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]): assert self.lister_obj.id is not None vcs_type, repos = page - + repos_it = iter(repos) try: - for repo in repos: + while True: + repo = self.get_next_repo(repos_it) origin_url = origin(vcs_type, repo) # filter out origins with invalid URL @@ -175,6 +186,8 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]): url=origin_url, last_update=last_update, ) + except StopIteration: + pass except RestfulError as e: logger.warning("Listing %s origins raised %s", vcs_type, e) diff --git a/swh/lister/launchpad/tests/test_lister.py b/swh/lister/launchpad/tests/test_lister.py index 59fe605..12c3095 100644 --- a/swh/lister/launchpad/tests/test_lister.py +++ b/swh/lister/launchpad/tests/test_lister.py @@ -26,18 +26,23 @@ class _Repo: class _Collection: entries: List[_Repo] = [] - def __init__(self, file): - self.entries = [_Repo(r) for r in file] + def __init__(self, repos): + self.repos = repos + self.it = iter(self.repos) + + def __next__(self): + return next(self.it) def __getitem__(self, key): - return self.entries[key] + return self.repos[key] def __len__(self): - return len(self.entries) + return len(self.repos) def _launchpad_response(datadir, datafile): - return _Collection(json.loads(Path(datadir, datafile).read_text())) + repos = json.loads(Path(datadir, datafile).read_text()) + return _Collection([_Repo(r) for r in repos]) @pytest.fixture @@ -194,7 +199,9 @@ def test_launchpad_incremental_lister( def test_launchpad_lister_invalid_url_filtering( swh_scheduler, mocker, ): - invalid_origin = [_Repo({"git_https_url": "tag:launchpad.net:2008:redacted",})] + invalid_origin = _Collection( + [_Repo({"git_https_url": "tag:launchpad.net:2008:redacted",})] + ) _mock_launchpad(mocker, invalid_origin) lister = LaunchpadLister(scheduler=swh_scheduler) stats = lister.run() @@ -213,7 +220,7 @@ def test_launchpad_lister_duplicated_origin( "date_last_modified": "2021-01-14 21:05:31.231406+00:00", } ) - origins = [origin, origin] + origins = _Collection([origin, origin]) _mock_launchpad(mocker, origins) lister = LaunchpadLister(scheduler=swh_scheduler) stats = lister.run() Comment Actions
yes, totally.
And that part escaped me, indeed, let's try the next origin... Thanks for pointing it Comment Actions Build is green Patch application report for D7198 (id=26102)Rebasing onto 6a7479553e... Current branch diff-target is up to date. Changes applied before testcommit 46401122b9596c96c0d5250458eb0ad4bce76904 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Feb 17 18:16:43 2022 +0100 launchpad: Ignore erratic origin reading failure Related to T3948 See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/464/ for more details. Comment Actions No need for a full blown try: except: block, restrict to where the iteration consumption Comment Actions Build is green Patch application report for D7198 (id=26103)Rebasing onto 6a7479553e... Current branch diff-target is up to date. Changes applied before testcommit d18f7b53a2368ee9ffb5cb86c4b6b87bf8433263 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Feb 17 18:16:43 2022 +0100 launchpad: Ignore erratic origin reading failure Related to T3948 See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/465/ for more details. Comment Actions Build is green Patch application report for D7198 (id=26104)Rebasing onto 6a7479553e... Current branch diff-target is up to date. Changes applied before testcommit f3c6a9ee80660aef19019d95b747a2f5835b3afd Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Feb 17 18:16:43 2022 +0100 launchpad: Ignore erratic origin reading failure Related to T3948 See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/466/ for more details. Comment Actions Revert to initial commit (other suggestions do not work as retry on iterator cannot Comment Actions So after some tests, what I proposed in D7198#187379 simply does not work as iterators cannot be retried. So wrapping the repos iteration in a try/except block is equivalent. Let's use that and see on staging if the incremental lister can continue where it stopped when a RestfulError is raised. Comment Actions Build is green Patch application report for D7198 (id=26105)Rebasing onto 6a7479553e... Current branch diff-target is up to date. Changes applied before testcommit 2568ecc7c2fa1082f7a739c383b4659756d476cc Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Feb 17 18:16:43 2022 +0100 launchpad: Ignore erratic page and continue listing next page The decorator is dropped on `get_origins_from_page` as we cannot retry an iterator consumption anyway. Related to T3948 See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/467/ for more details. |