diff --git a/PKG-INFO b/PKG-INFO index 20bed91..edc8cf0 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,127 +1,127 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 2.6.4 +Version: 2.7.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO index 20bed91..edc8cf0 100644 --- a/swh.lister.egg-info/PKG-INFO +++ b/swh.lister.egg-info/PKG-INFO @@ -1,127 +1,127 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 2.6.4 +Version: 2.7.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/SOURCES.txt b/swh.lister.egg-info/SOURCES.txt index 433810e..749cc6a 100644 --- a/swh.lister.egg-info/SOURCES.txt +++ b/swh.lister.egg-info/SOURCES.txt @@ -1,248 +1,251 @@ .gitignore .pre-commit-config.yaml ACKNOWLEDGEMENTS CODE_OF_CONDUCT.md CONTRIBUTORS LICENSE MANIFEST.in Makefile README.md conftest.py mypy.ini pyproject.toml pytest.ini requirements-swh.txt requirements-test.txt requirements.txt setup.cfg setup.py tox.ini docs/.gitignore docs/Makefile docs/cli.rst docs/conf.py docs/index.rst docs/new_lister_template.py docs/run_a_new_lister.rst docs/save_forge.rst docs/tutorial.rst docs/_static/.placeholder docs/_templates/.placeholder docs/images/new_base.png docs/images/new_bitbucket_lister.png docs/images/new_github_lister.png docs/images/old_github_lister.png sql/crawler.sql sql/pimp_db.sql swh/__init__.py swh.lister.egg-info/PKG-INFO swh.lister.egg-info/SOURCES.txt swh.lister.egg-info/dependency_links.txt swh.lister.egg-info/entry_points.txt swh.lister.egg-info/requires.txt swh.lister.egg-info/top_level.txt swh/lister/__init__.py swh/lister/cli.py swh/lister/pattern.py swh/lister/py.typed swh/lister/utils.py swh/lister/bitbucket/__init__.py swh/lister/bitbucket/lister.py swh/lister/bitbucket/tasks.py swh/lister/bitbucket/tests/__init__.py swh/lister/bitbucket/tests/test_lister.py swh/lister/bitbucket/tests/test_tasks.py swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json swh/lister/cgit/__init__.py swh/lister/cgit/lister.py swh/lister/cgit/tasks.py swh/lister/cgit/tests/__init__.py swh/lister/cgit/tests/repo_list.txt swh/lister/cgit/tests/test_lister.py swh/lister/cgit/tests/test_tasks.py swh/lister/cgit/tests/data/https_git.baserock.org/cgit swh/lister/cgit/tests/data/https_git.eclipse.org/c swh/lister/cgit/tests/data/https_git.savannah.gnu.org/README swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit_elisp-es.git swh/lister/cgit/tests/data/https_git.tizen/README swh/lister/cgit/tests/data/https_git.tizen/cgit swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=100 swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=50 swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Users swh/lister/cgit/tests/data/https_git.tizen/cgit_Lock-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-base swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-mc1n2 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e3250 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e4x12 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_devices_nfc-plugin-nxp swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_intel_mfld_bootstub-mfld-blackbay swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_mtdev swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_opengl-es-virtual-drv swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libdrm swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libnl swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_xorg_driver_xserver-xorg-misc swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-gallery-efl swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-homescreen-efl swh/lister/cgit/tests/data/https_jff.email/cgit swh/lister/cran/__init__.py swh/lister/cran/list_all_packages.R swh/lister/cran/lister.py swh/lister/cran/tasks.py swh/lister/cran/tests/__init__.py swh/lister/cran/tests/test_lister.py swh/lister/cran/tests/test_tasks.py swh/lister/cran/tests/data/list-r-packages.json swh/lister/debian/__init__.py swh/lister/debian/lister.py swh/lister/debian/tasks.py swh/lister/debian/tests/__init__.py swh/lister/debian/tests/test_lister.py swh/lister/debian/tests/test_tasks.py swh/lister/debian/tests/data/Sources_bullseye swh/lister/debian/tests/data/Sources_buster swh/lister/debian/tests/data/Sources_stretch swh/lister/gitea/__init__.py swh/lister/gitea/lister.py swh/lister/gitea/tasks.py swh/lister/gitea/tests/__init__.py swh/lister/gitea/tests/test_lister.py swh/lister/gitea/tests/test_tasks.py swh/lister/gitea/tests/data/https_try.gitea.io/repos_page1 swh/lister/gitea/tests/data/https_try.gitea.io/repos_page2 swh/lister/github/__init__.py swh/lister/github/lister.py swh/lister/github/tasks.py swh/lister/github/tests/__init__.py swh/lister/github/tests/test_lister.py swh/lister/github/tests/test_tasks.py swh/lister/gitlab/__init__.py swh/lister/gitlab/lister.py swh/lister/gitlab/tasks.py swh/lister/gitlab/tests/__init__.py swh/lister/gitlab/tests/test_lister.py swh/lister/gitlab/tests/test_tasks.py swh/lister/gitlab/tests/data/https_foss.heptapod.net/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page2.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page3.json swh/lister/gitlab/tests/data/https_gitlab.com/api_response_page1.json swh/lister/gnu/__init__.py swh/lister/gnu/lister.py swh/lister/gnu/tasks.py swh/lister/gnu/tree.py swh/lister/gnu/tests/__init__.py swh/lister/gnu/tests/test_lister.py swh/lister/gnu/tests/test_tasks.py swh/lister/gnu/tests/test_tree.py swh/lister/gnu/tests/data/tree.json swh/lister/gnu/tests/data/tree.min.json swh/lister/gnu/tests/data/https_ftp.gnu.org/tree.json.gz swh/lister/launchpad/__init__.py swh/lister/launchpad/lister.py swh/lister/launchpad/tasks.py swh/lister/launchpad/tests/__init__.py swh/lister/launchpad/tests/conftest.py swh/lister/launchpad/tests/test_lister.py swh/lister/launchpad/tests/test_tasks.py +swh/lister/launchpad/tests/data/launchpad_bzr_response.json swh/lister/launchpad/tests/data/launchpad_response1.json swh/lister/launchpad/tests/data/launchpad_response2.json swh/lister/maven/README.md swh/lister/maven/__init__.py swh/lister/maven/lister.py swh/lister/maven/tasks.py swh/lister/maven/tests/__init__.py swh/lister/maven/tests/test_lister.py swh/lister/maven/tests/test_tasks.py swh/lister/maven/tests/data/http_indexes/export.fld swh/lister/maven/tests/data/http_indexes/export_incr.fld swh/lister/maven/tests/data/https_maven.org/arangodb-graphql-1.2.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.0.malformed.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.0.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.1.pom swh/lister/npm/__init__.py swh/lister/npm/lister.py swh/lister/npm/tasks.py swh/lister/npm/tests/test_lister.py swh/lister/npm/tests/test_tasks.py swh/lister/npm/tests/data/npm_full_page1.json swh/lister/npm/tests/data/npm_full_page2.json swh/lister/npm/tests/data/npm_incremental_page1.json swh/lister/npm/tests/data/npm_incremental_page2.json swh/lister/opam/__init__.py swh/lister/opam/lister.py swh/lister/opam/tasks.py swh/lister/opam/tests/__init__.py swh/lister/opam/tests/test_lister.py swh/lister/opam/tests/test_tasks.py swh/lister/opam/tests/data/fake_opam_repo/repo swh/lister/opam/tests/data/fake_opam_repo/version swh/lister/opam/tests/data/fake_opam_repo/packages/agrid/agrid.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.4/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.5/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.6/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/ocb/ocb.0.1/opam swh/lister/packagist/__init__.py swh/lister/packagist/lister.py swh/lister/packagist/tasks.py swh/lister/packagist/tests/__init__.py swh/lister/packagist/tests/test_lister.py swh/lister/packagist/tests/test_tasks.py swh/lister/packagist/tests/data/den1n_contextmenu.json swh/lister/packagist/tests/data/ljjackson_linnworks.json swh/lister/packagist/tests/data/lky_wx_article.json swh/lister/packagist/tests/data/spryker-eco_computop-api.json swh/lister/phabricator/__init__.py swh/lister/phabricator/lister.py swh/lister/phabricator/tasks.py swh/lister/phabricator/tests/__init__.py swh/lister/phabricator/tests/test_lister.py swh/lister/phabricator/tests/test_tasks.py swh/lister/phabricator/tests/data/__init__.py swh/lister/phabricator/tests/data/phabricator_api_repositories_page1.json swh/lister/phabricator/tests/data/phabricator_api_repositories_page2.json swh/lister/pypi/__init__.py swh/lister/pypi/lister.py swh/lister/pypi/tasks.py swh/lister/pypi/tests/__init__.py swh/lister/pypi/tests/test_lister.py swh/lister/pypi/tests/test_tasks.py swh/lister/sourceforge/__init__.py swh/lister/sourceforge/lister.py swh/lister/sourceforge/tasks.py swh/lister/sourceforge/tests/__init__.py swh/lister/sourceforge/tests/test_lister.py swh/lister/sourceforge/tests/test_tasks.py +swh/lister/sourceforge/tests/data/aaron.html +swh/lister/sourceforge/tests/data/aaron.json swh/lister/sourceforge/tests/data/adobexmp.json swh/lister/sourceforge/tests/data/backapps-website.json swh/lister/sourceforge/tests/data/backapps.json swh/lister/sourceforge/tests/data/bzr-repo.json swh/lister/sourceforge/tests/data/main-sitemap.xml swh/lister/sourceforge/tests/data/mojunk.json swh/lister/sourceforge/tests/data/mramm.json swh/lister/sourceforge/tests/data/os3dmodels.json swh/lister/sourceforge/tests/data/random-mercurial.json swh/lister/sourceforge/tests/data/subsitemap-0.xml swh/lister/sourceforge/tests/data/subsitemap-1.xml swh/lister/tests/__init__.py swh/lister/tests/test_cli.py swh/lister/tests/test_pattern.py swh/lister/tests/test_utils.py swh/lister/tuleap/__init__.py swh/lister/tuleap/lister.py swh/lister/tuleap/tasks.py swh/lister/tuleap/tests/__init__.py swh/lister/tuleap/tests/test_lister.py swh/lister/tuleap/tests/test_tasks.py swh/lister/tuleap/tests/data/https_tuleap.net/projects swh/lister/tuleap/tests/data/https_tuleap.net/repo_1 swh/lister/tuleap/tests/data/https_tuleap.net/repo_2 swh/lister/tuleap/tests/data/https_tuleap.net/repo_3 \ No newline at end of file diff --git a/swh.lister.egg-info/entry_points.txt b/swh.lister.egg-info/entry_points.txt index 840f4d6..0318375 100644 --- a/swh.lister.egg-info/entry_points.txt +++ b/swh.lister.egg-info/entry_points.txt @@ -1,22 +1,21 @@ +[swh.cli.subcommands] +lister = swh.lister.cli - [swh.cli.subcommands] - lister=swh.lister.cli - [swh.workers] - lister.bitbucket=swh.lister.bitbucket:register - lister.cgit=swh.lister.cgit:register - lister.cran=swh.lister.cran:register - lister.debian=swh.lister.debian:register - lister.gitea=swh.lister.gitea:register - lister.github=swh.lister.github:register - lister.gitlab=swh.lister.gitlab:register - lister.gnu=swh.lister.gnu:register - lister.launchpad=swh.lister.launchpad:register - lister.npm=swh.lister.npm:register - lister.opam=swh.lister.opam:register - lister.packagist=swh.lister.packagist:register - lister.phabricator=swh.lister.phabricator:register - lister.pypi=swh.lister.pypi:register - lister.sourceforge=swh.lister.sourceforge:register - lister.tuleap=swh.lister.tuleap:register - lister.maven=swh.lister.maven:register - \ No newline at end of file +[swh.workers] +lister.bitbucket = swh.lister.bitbucket:register +lister.cgit = swh.lister.cgit:register +lister.cran = swh.lister.cran:register +lister.debian = swh.lister.debian:register +lister.gitea = swh.lister.gitea:register +lister.github = swh.lister.github:register +lister.gitlab = swh.lister.gitlab:register +lister.gnu = swh.lister.gnu:register +lister.launchpad = swh.lister.launchpad:register +lister.maven = swh.lister.maven:register +lister.npm = swh.lister.npm:register +lister.opam = swh.lister.opam:register +lister.packagist = swh.lister.packagist:register +lister.phabricator = swh.lister.phabricator:register +lister.pypi = swh.lister.pypi:register +lister.sourceforge = swh.lister.sourceforge:register +lister.tuleap = swh.lister.tuleap:register diff --git a/swh/lister/launchpad/lister.py b/swh/lister/launchpad/lister.py index 381106d..5b60da7 100644 --- a/swh/lister/launchpad/lister.py +++ b/swh/lister/launchpad/lister.py @@ -1,132 +1,202 @@ -# Copyright (C) 2020-2021 The Software Heritage developers +# Copyright (C) 2020-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from dataclasses import dataclass from datetime import datetime import logging -from typing import Any, Dict, Iterator, Optional +from typing import Any, Dict, Iterator, Optional, Tuple import iso8601 from launchpadlib.launchpad import Launchpad +from lazr.restfulclient.errors import RestfulError from lazr.restfulclient.resource import Collection +from swh.lister.utils import retry_if_exception, throttling_retry from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from ..pattern import CredentialsType, Lister logger = logging.getLogger(__name__) -LaunchpadPageType = Iterator[Collection] +VcsType = str +LaunchpadPageType = Tuple[VcsType, Collection] + + +SUPPORTED_VCS_TYPES = ("git", "bzr") @dataclass class LaunchpadListerState: """State of Launchpad lister""" - date_last_modified: Optional[datetime] = None - """modification date of last updated repository since last listing""" + git_date_last_modified: Optional[datetime] = None + """modification date of last updated git repository since last listing""" + bzr_date_last_modified: Optional[datetime] = None + """modification date of last updated bzr repository since last listing""" + + +def origin(vcs_type: str, repo: Any) -> str: + """Determine the origin url out of a repository with a given vcs_type""" + return repo.git_https_url if vcs_type == "git" else repo.web_link + + +def retry_if_restful_error(retry_state): + return retry_if_exception(retry_state, lambda e: isinstance(e, RestfulError)) class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]): """ - List git repositories from Launchpad. + List repositories from Launchpad (git or bzr). Args: scheduler: instance of SchedulerInterface incremental: defines if incremental listing should be used, in that case only modified or new repositories since last incremental listing operation will be returned """ LISTER_NAME = "launchpad" def __init__( self, scheduler: SchedulerInterface, incremental: bool = False, credentials: CredentialsType = None, ): super().__init__( scheduler=scheduler, url="https://launchpad.net/", instance="launchpad", credentials=credentials, ) self.incremental = incremental - self.date_last_modified = None + self.date_last_modified: Dict[str, Optional[datetime]] = { + "git": None, + "bzr": None, + } def state_from_dict(self, d: Dict[str, Any]) -> LaunchpadListerState: - date_last_modified = d.get("date_last_modified") - if date_last_modified is not None: - d["date_last_modified"] = iso8601.parse_date(date_last_modified) + for vcs_type in SUPPORTED_VCS_TYPES: + key = f"{vcs_type}_date_last_modified" + date_last_modified = d.get(key) + if date_last_modified is not None: + d[key] = iso8601.parse_date(date_last_modified) + return LaunchpadListerState(**d) def state_to_dict(self, state: LaunchpadListerState) -> Dict[str, Any]: - d: Dict[str, Optional[str]] = {"date_last_modified": None} - date_last_modified = state.date_last_modified - if date_last_modified is not None: - d["date_last_modified"] = date_last_modified.isoformat() + d: Dict[str, Optional[str]] = {} + for vcs_type in SUPPORTED_VCS_TYPES: + attribute_name = f"{vcs_type}_date_last_modified" + d[attribute_name] = None + + if hasattr(state, attribute_name): + date_last_modified = getattr(state, attribute_name) + if date_last_modified is not None: + d[attribute_name] = date_last_modified.isoformat() return d + @throttling_retry(retry=retry_if_restful_error) + def _page_request( + self, launchpad, vcs_type: str, date_last_modified: Optional[datetime] + ) -> Optional[Collection]: + """Querying the page of results for a given vcs_type since the date_last_modified. If + some issues occurs, this will deal with the retrying policy. + + """ + get_vcs_fns = { + "git": launchpad.git_repositories.getRepositories, + "bzr": launchpad.branches.getBranches, + } + + return get_vcs_fns[vcs_type]( + order_by="most neglected first", modified_since_date=date_last_modified, + ) + def get_pages(self) -> Iterator[LaunchpadPageType]: """ - Yields an iterator on all git repositories hosted on Launchpad sorted + Yields an iterator on all git/bzr repositories hosted on Launchpad sorted by last modification date in ascending order. """ launchpad = Launchpad.login_anonymously( "softwareheritage", "production", version="devel" ) - date_last_modified = None if self.incremental: - date_last_modified = self.state.date_last_modified - get_repos = launchpad.git_repositories.getRepositories - yield get_repos( - order_by="most neglected first", modified_since_date=date_last_modified - ) + self.date_last_modified = { + "git": self.state.git_date_last_modified, + "bzr": self.state.bzr_date_last_modified, + } + for vcs_type in SUPPORTED_VCS_TYPES: + try: + result = self._page_request( + launchpad, vcs_type, self.date_last_modified[vcs_type] + ) + except RestfulError as e: + logger.warning("Listing %s origins raised %s", vcs_type, e) + result = None + if not result: + continue + yield vcs_type, result + @throttling_retry(retry=retry_if_restful_error) def get_origins_from_page(self, page: LaunchpadPageType) -> Iterator[ListedOrigin]: """ Iterate on all git repositories and yield ListedOrigin instances. """ assert self.lister_obj.id is not None - prev_origin_url = None + vcs_type, repos = page - for repo in page: + for repo in repos: + origin_url = origin(vcs_type, repo) - origin_url = repo.git_https_url - - # filter out origins with invalid URL or origin previously listed - # (last modified repository will be listed twice by launchpadlib) - if not origin_url.startswith("https://") or origin_url == prev_origin_url: + # filter out origins with invalid URL + if not origin_url.startswith("https://"): continue last_update = repo.date_last_modified - self.date_last_modified = last_update - - logger.debug("Found origin %s last updated on %s", origin_url, last_update) + self.date_last_modified[vcs_type] = last_update - prev_origin_url = origin_url + logger.debug( + "Found origin %s with type %s last updated on %s", + origin_url, + vcs_type, + last_update, + ) yield ListedOrigin( lister_id=self.lister_obj.id, - visit_type="git", + visit_type=vcs_type, url=origin_url, last_update=last_update, ) def finalize(self) -> None: - if self.date_last_modified is None: + git_date_last_modified = self.date_last_modified["git"] + bzr_date_last_modified = self.date_last_modified["bzr"] + if git_date_last_modified is None and bzr_date_last_modified is None: return if self.incremental and ( - self.state.date_last_modified is None - or self.date_last_modified > self.state.date_last_modified + self.state.git_date_last_modified is None + or ( + git_date_last_modified is not None + and git_date_last_modified > self.state.git_date_last_modified + ) + ): + self.state.git_date_last_modified = git_date_last_modified + + if self.incremental and ( + self.state.bzr_date_last_modified is None + or ( + bzr_date_last_modified is not None + and bzr_date_last_modified > self.state.bzr_date_last_modified + ) ): - self.state.date_last_modified = self.date_last_modified + self.state.bzr_date_last_modified = self.date_last_modified["bzr"] self.updated = True diff --git a/swh/lister/launchpad/tests/data/launchpad_bzr_response.json b/swh/lister/launchpad/tests/data/launchpad_bzr_response.json new file mode 100644 index 0000000..3341c82 --- /dev/null +++ b/swh/lister/launchpad/tests/data/launchpad_bzr_response.json @@ -0,0 +1,126 @@ +[ + { + "self_link": "https://api.launchpad.net/1.0/fourbar", + "web_link": "https://launchpad.net/fourbar", + "resource_type_link": "https://api.launchpad.net/1.0/#project", + "official_answers": true, + "official_blueprints": true, + "official_codehosting": true, + "official_bugs": true, + "information_type": "Public", + "active": true, + "bug_reporting_guidelines": null, + "bug_reported_acknowledgement": null, + "official_bug_tags": [], + "recipes_collection_link": "https://api.launchpad.net/1.0/fourbar/recipes", + "active_milestones_collection_link": "https://api.launchpad.net/1.0/fourbar/active_milestones", + "all_milestones_collection_link": "https://api.launchpad.net/1.0/fourbar/all_milestones", + "bug_supervisor_link": null, + "qualifies_for_free_hosting": true, + "reviewer_whiteboard": "tag:launchpad.net:2008:redacted", + "is_permitted": "tag:launchpad.net:2008:redacted", + "project_reviewed": "tag:launchpad.net:2008:redacted", + "license_approved": "tag:launchpad.net:2008:redacted", + "private": false, + "display_name": "fourBar", + "icon_link": "https://api.launchpad.net/1.0/fourbar/icon", + "logo_link": "https://api.launchpad.net/1.0/fourbar/logo", + "name": "fourbar", + "owner_link": "https://api.launchpad.net/1.0/~sorivenul", + "project_group_link": null, + "title": "fourBar", + "registrant_link": "https://api.launchpad.net/1.0/~sorivenul", + "driver_link": null, + "summary": "fourBar is a minimal application launcher for POSIX systems. It launches four commonly used applications (terminal, file browser, editor, and web browser by default). It is written in Python/Tkinter. Documentation on simple customization is included. ", + "description": "If you wish to help with the development of fourBar, download a branch, test, report bugs and propose features. There is still work to be done.", + "date_created": "2008-11-03T07:03:00.872230+00:00", + "homepage_url": null, + "wiki_url": null, + "screenshots_url": null, + "download_url": "http://downloads.sourceforge.net/fourbar/fourbar-1.0.0.tar.gz?modtime=1224102066&big_mirror=0", + "programming_language": "Python", + "sourceforge_project": "fourBar", + "freshmeat_project": null, + "brand_link": "https://api.launchpad.net/1.0/fourbar/brand", + "private_bugs": false, + "licenses": [ + "GNU GPL v3" + ], + "license_info": null, + "bug_tracker_link": null, + "date_next_suggest_packaging": null, + "series_collection_link": "https://api.launchpad.net/1.0/fourbar/series", + "development_focus_link": "https://api.launchpad.net/1.0/fourbar/trunk", + "releases_collection_link": "https://api.launchpad.net/1.0/fourbar/releases", + "translation_focus_link": null, + "commercial_subscription_link": null, + "commercial_subscription_is_due": false, + "remote_product": "242408&1119369", + "security_contact": null, + "vcs": "Bazaar", + "http_etag": "\"e3685b989bd2609f9a84bd2d90bef380c6f3c92b-13a47c4e8b4688c8fc042bf7eede3a2f4c14a9d6\"", + "date_last_modified":"2016-05-19T16:05:23.706734+00:00" + }, + { + "self_link": "https://api.launchpad.net/1.0/gekkoware", + "web_link": "https://launchpad.net/gekkoware", + "resource_type_link": "https://api.launchpad.net/1.0/#project", + "official_answers": false, + "official_blueprints": false, + "official_codehosting": false, + "official_bugs": false, + "information_type": "Public", + "active": true, + "bug_reporting_guidelines": null, + "bug_reported_acknowledgement": null, + "official_bug_tags": [], + "recipes_collection_link": "https://api.launchpad.net/1.0/gekkoware/recipes", + "active_milestones_collection_link": "https://api.launchpad.net/1.0/gekkoware/active_milestones", + "all_milestones_collection_link": "https://api.launchpad.net/1.0/gekkoware/all_milestones", + "bug_supervisor_link": null, + "qualifies_for_free_hosting": true, + "reviewer_whiteboard": "tag:launchpad.net:2008:redacted", + "is_permitted": "tag:launchpad.net:2008:redacted", + "project_reviewed": "tag:launchpad.net:2008:redacted", + "license_approved": "tag:launchpad.net:2008:redacted", + "private": false, + "display_name": "gekkoware", + "icon_link": "https://api.launchpad.net/1.0/gekkoware/icon", + "logo_link": "https://api.launchpad.net/1.0/gekkoware/logo", + "name": "gekkoware", + "owner_link": "https://api.launchpad.net/1.0/~compermisos", + "project_group_link": null, + "title": "gekkoware", + "registrant_link": "https://api.launchpad.net/1.0/~compermisos", + "driver_link": null, + "summary": "A port of gekko to ubuntu", + "description": null, + "date_created": "2007-10-21T03:02:22.186775+00:00", + "homepage_url": "http://gekkoware.org", + "wiki_url": null, + "screenshots_url": null, + "download_url": null, + "programming_language": "php", + "sourceforge_project": "gekkoware", + "freshmeat_project": null, + "brand_link": "https://api.launchpad.net/1.0/gekkoware/brand", + "private_bugs": false, + "licenses": [ + "GNU GPL v2" + ], + "license_info": null, + "bug_tracker_link": null, + "date_next_suggest_packaging": null, + "series_collection_link": "https://api.launchpad.net/1.0/gekkoware/series", + "development_focus_link": "https://api.launchpad.net/1.0/gekkoware/trunk", + "releases_collection_link": "https://api.launchpad.net/1.0/gekkoware/releases", + "translation_focus_link": null, + "commercial_subscription_link": null, + "commercial_subscription_is_due": false, + "remote_product": "117004&676653", + "security_contact": null, + "vcs": "Bazaar", + "http_etag": "\"b9802efcebb5afdd87c8ee10f8473040340bcead-159127be59c12e7cbb161eee4cae2ade72353c0d\"", + "date_last_modified":"2017-03-15T16:03:22.706432+00:00" + } +] diff --git a/swh/lister/launchpad/tests/test_lister.py b/swh/lister/launchpad/tests/test_lister.py index 836fcec..59fe605 100644 --- a/swh/lister/launchpad/tests/test_lister.py +++ b/swh/lister/launchpad/tests/test_lister.py @@ -1,175 +1,256 @@ -# Copyright (C) 2020-2021 The Software Heritage developers +# Copyright (C) 2020-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime import json from pathlib import Path from typing import List +from lazr.restfulclient.errors import RestfulError import pytest -from ..lister import LaunchpadLister +from ..lister import LaunchpadLister, origin class _Repo: def __init__(self, d: dict): for key in d.keys(): if key == "date_last_modified": setattr(self, key, datetime.fromisoformat(d[key])) else: setattr(self, key, d[key]) class _Collection: entries: List[_Repo] = [] def __init__(self, file): self.entries = [_Repo(r) for r in file] def __getitem__(self, key): return self.entries[key] def __len__(self): return len(self.entries) def _launchpad_response(datadir, datafile): return _Collection(json.loads(Path(datadir, datafile).read_text())) @pytest.fixture def launchpad_response1(datadir): return _launchpad_response(datadir, "launchpad_response1.json") @pytest.fixture def launchpad_response2(datadir): return _launchpad_response(datadir, "launchpad_response2.json") -def _mock_getRepositories(mocker, launchpad_response): +@pytest.fixture +def launchpad_bzr_response(datadir): + return _launchpad_response(datadir, "launchpad_bzr_response.json") + + +def _mock_launchpad(mocker, launchpad_response, launchpad_bzr_response=None): mock_launchpad = mocker.patch("swh.lister.launchpad.lister.Launchpad") mock_getRepositories = mock_launchpad.git_repositories.getRepositories - mock_getRepositories.return_value = launchpad_response + if isinstance(launchpad_response, Exception): + mock_getRepositories.side_effect = launchpad_response + else: + mock_getRepositories.return_value = launchpad_response + mock_getBranches = mock_launchpad.branches.getBranches + if launchpad_bzr_response is not None: + if isinstance(launchpad_bzr_response, Exception): + mock_getBranches.side_effect = launchpad_bzr_response + else: + mock_getBranches.return_value = launchpad_bzr_response + else: + mock_getBranches.return_value = [] # empty page mock_launchpad.login_anonymously.return_value = mock_launchpad - return mock_getRepositories + return mock_getRepositories, mock_getBranches -def _check_listed_origins(scheduler_origins, launchpad_response): - for origin in launchpad_response: +def _check_listed_origins(scheduler_origins, launchpad_response, vcs_type="git"): + for repo in launchpad_response: filtered_origins = [ - o for o in scheduler_origins if o.url == origin.git_https_url + o for o in scheduler_origins if o.url == origin(vcs_type, repo) ] assert len(filtered_origins) == 1 - assert filtered_origins[0].last_update == origin.date_last_modified + assert filtered_origins[0].last_update == repo.date_last_modified + assert filtered_origins[0].visit_type == vcs_type def test_lister_from_configfile(swh_scheduler_config, mocker): load_from_envvar = mocker.patch("swh.lister.pattern.load_from_envvar") load_from_envvar.return_value = { "scheduler": {"cls": "local", **swh_scheduler_config}, "credentials": {}, } lister = LaunchpadLister.from_configfile() assert lister.scheduler is not None assert lister.credentials is not None -def test_launchpad_full_lister(swh_scheduler, mocker, launchpad_response1): - mock_getRepositories = _mock_getRepositories(mocker, launchpad_response1) +def test_launchpad_full_lister( + swh_scheduler, mocker, launchpad_response1, launchpad_bzr_response +): + mock_getRepositories, mock_getBranches = _mock_launchpad( + mocker, launchpad_response1, launchpad_bzr_response + ) lister = LaunchpadLister(scheduler=swh_scheduler) stats = lister.run() assert not lister.incremental assert lister.updated - assert stats.pages == 1 - assert stats.origins == len(launchpad_response1) + assert stats.pages == 1 + 1, "Expects 1 page for git origins, another for bzr ones" + assert stats.origins == len(launchpad_response1) + len(launchpad_bzr_response) mock_getRepositories.assert_called_once_with( order_by="most neglected first", modified_since_date=None ) + mock_getBranches.assert_called_once_with( + order_by="most neglected first", modified_since_date=None + ) scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - assert len(scheduler_origins) == len(launchpad_response1) + assert len(scheduler_origins) == len(launchpad_response1) + len( + launchpad_bzr_response + ) _check_listed_origins(scheduler_origins, launchpad_response1) + _check_listed_origins(scheduler_origins, launchpad_bzr_response, vcs_type="bzr") def test_launchpad_incremental_lister( - swh_scheduler, mocker, launchpad_response1, launchpad_response2 + swh_scheduler, + mocker, + launchpad_response1, + launchpad_response2, + launchpad_bzr_response, ): - mock_getRepositories = _mock_getRepositories(mocker, launchpad_response1) + mock_getRepositories, mock_getBranches = _mock_launchpad( + mocker, launchpad_response1, launchpad_bzr_response + ) lister = LaunchpadLister(scheduler=swh_scheduler, incremental=True) stats = lister.run() assert lister.incremental assert lister.updated - assert stats.pages == 1 - assert stats.origins == len(launchpad_response1) + assert stats.pages == 1 + 1, "Expects 1 page for git origins, another for bzr ones" + len_first_runs = len(launchpad_response1) + len(launchpad_bzr_response) + assert stats.origins == len_first_runs mock_getRepositories.assert_called_once_with( order_by="most neglected first", modified_since_date=None ) + mock_getBranches.assert_called_once_with( + order_by="most neglected first", modified_since_date=None + ) lister_state = lister.get_state_from_scheduler() - assert lister_state.date_last_modified == launchpad_response1[-1].date_last_modified + assert ( + lister_state.git_date_last_modified + == launchpad_response1[-1].date_last_modified + ) + assert ( + lister_state.bzr_date_last_modified + == launchpad_bzr_response[-1].date_last_modified + ) - mock_getRepositories = _mock_getRepositories(mocker, launchpad_response2) + mock_getRepositories, mock_getBranches = _mock_launchpad( + mocker, launchpad_response2 + ) lister = LaunchpadLister(scheduler=swh_scheduler, incremental=True) stats = lister.run() assert lister.incremental assert lister.updated - assert stats.pages == 1 + assert stats.pages == 1, "Empty bzr page response is ignored" assert stats.origins == len(launchpad_response2) mock_getRepositories.assert_called_once_with( order_by="most neglected first", - modified_since_date=lister_state.date_last_modified, + modified_since_date=lister_state.git_date_last_modified, ) scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - assert len(scheduler_origins) == len(launchpad_response1) + len(launchpad_response2) + assert len(scheduler_origins) == len_first_runs + len(launchpad_response2) _check_listed_origins(scheduler_origins, launchpad_response1) + _check_listed_origins(scheduler_origins, launchpad_bzr_response, vcs_type="bzr") _check_listed_origins(scheduler_origins, launchpad_response2) def test_launchpad_lister_invalid_url_filtering( swh_scheduler, mocker, ): invalid_origin = [_Repo({"git_https_url": "tag:launchpad.net:2008:redacted",})] - _mock_getRepositories(mocker, invalid_origin) + _mock_launchpad(mocker, invalid_origin) lister = LaunchpadLister(scheduler=swh_scheduler) stats = lister.run() assert not lister.updated - assert stats.pages == 1 + assert stats.pages == 1, "Empty pages are ignored(only 1 git page of results)" assert stats.origins == 0 def test_launchpad_lister_duplicated_origin( swh_scheduler, mocker, ): origin = _Repo( { "git_https_url": "https://git.launchpad.net/test", "date_last_modified": "2021-01-14 21:05:31.231406+00:00", } ) origins = [origin, origin] - _mock_getRepositories(mocker, origins) + _mock_launchpad(mocker, origins) lister = LaunchpadLister(scheduler=swh_scheduler) stats = lister.run() assert lister.updated - assert stats.pages == 1 + assert stats.pages == 1, "Empty bzr page are ignored (only 1 git page of results)" assert stats.origins == 1 + + +def test_launchpad_lister_raise_during_listing( + swh_scheduler, mocker, launchpad_response1, launchpad_bzr_response +): + lister = LaunchpadLister(scheduler=swh_scheduler) + # Exponential retries take a long time, so stub time.sleep + mocker.patch.object(lister._page_request.retry, "sleep") + + mock_getRepositories, mock_getBranches = _mock_launchpad( + mocker, + RestfulError("Refuse to list git page"), # breaks git page listing + launchpad_bzr_response, + ) + + stats = lister.run() + + assert lister.updated + assert stats.pages == 1 + assert stats.origins == len(launchpad_bzr_response) + + mock_getRepositories, mock_getBranches = _mock_launchpad( + mocker, + launchpad_response1, + RestfulError("Refuse to list bzr"), # breaks bzr page listing + ) + + lister = LaunchpadLister(scheduler=swh_scheduler) + stats = lister.run() + + assert lister.updated + assert stats.pages == 1 + assert stats.origins == len(launchpad_response1) diff --git a/swh/lister/sourceforge/lister.py b/swh/lister/sourceforge/lister.py index 71ee615..c0153c5 100644 --- a/swh/lister/sourceforge/lister.py +++ b/swh/lister/sourceforge/lister.py @@ -1,389 +1,419 @@ # Copyright (C) 2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from dataclasses import dataclass, field import datetime from enum import Enum import logging import re from typing import Any, Dict, Iterator, List, Optional, Set, Tuple from xml.etree import ElementTree +from bs4 import BeautifulSoup import iso8601 import requests from tenacity.before_sleep import before_sleep_log from swh.core.api.classes import stream_results from swh.lister.utils import retry_policy_generic, throttling_retry from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from .. import USER_AGENT from ..pattern import CredentialsType, Lister logger = logging.getLogger(__name__) class VcsNames(Enum): """Used to filter SourceForge tool names for valid VCS types""" # CVS projects are read-only CVS = "cvs" GIT = "git" SUBVERSION = "svn" MERCURIAL = "hg" BAZAAR = "bzr" VCS_NAMES = set(v.value for v in VcsNames.__members__.values()) @dataclass class SourceForgeListerEntry: vcs: VcsNames url: str last_modified: datetime.date SubSitemapNameT = str ProjectNameT = str # SourceForge only offers day-level granularity, which is good enough for our purposes LastModifiedT = datetime.date @dataclass class SourceForgeListerState: """Current state of the SourceForge lister in incremental runs """ """If the subsitemap does not exist, we assume a full run of this subsitemap is needed. If the date is the same, we skip the subsitemap, otherwise we request the subsitemap and look up every project's "last modified" date to compare against `ListedOrigins` from the database.""" subsitemap_last_modified: Dict[SubSitemapNameT, LastModifiedT] = field( default_factory=dict ) """Some projects (not the majority, but still meaningful) have no VCS for us to archive. We need to remember a mapping of their API URL to their "last modified" date so we don't keep querying them needlessly every time.""" empty_projects: Dict[str, LastModifiedT] = field(default_factory=dict) SourceForgeListerPage = List[SourceForgeListerEntry] MAIN_SITEMAP_URL = "https://sourceforge.net/allura_sitemap/sitemap.xml" SITEMAP_XML_NAMESPACE = "{http://www.sitemaps.org/schemas/sitemap/0.9}" # API resource endpoint for information about the given project. # # `namespace`: Project namespace. Very often `p`, but can be something else like # `adobe` # `project`: Project name, e.g. `seedai`. Can be a subproject, e.g `backapps/website`. PROJECT_API_URL_FORMAT = "https://sourceforge.net/rest/{namespace}/{project}" # Predictable URL for cloning (in the broad sense) a VCS registered for the project. # # Warning: does not apply to bzr repos, and Mercurial are http only, see use of this # constant below. # # `vcs`: VCS type, one of `VCS_NAMES` # `namespace`: Project namespace. Very often `p`, but can be something else like # `adobe`. # `project`: Project name, e.g. `seedai`. Can be a subproject, e.g `backapps/website`. # `mount_point`: url path used by the repo. For example, the Code::Blocks project uses # `git` (https://git.code.sf.net/p/codeblocks/git). CLONE_URL_FORMAT = "https://{vcs}.code.sf.net/{namespace}/{project}/{mount_point}" PROJ_URL_RE = re.compile( r"^https://sourceforge.net/(?P[^/]+)/(?P[^/]+)/(?P.*)?" ) # Mapping of `(namespace, project name)` to `last modified` date. ProjectsLastModifiedCache = Dict[Tuple[str, str], LastModifiedT] class SourceForgeLister(Lister[SourceForgeListerState, SourceForgeListerPage]): """List origins from the "SourceForge" forge. """ # Part of the lister API, that identifies this lister LISTER_NAME = "sourceforge" def __init__( self, scheduler: SchedulerInterface, incremental: bool = False, credentials: Optional[CredentialsType] = None, ): super().__init__( scheduler=scheduler, url="https://sourceforge.net", instance="main", credentials=credentials, ) # Will hold the currently saved "last modified" dates to compare against our # requests. self._project_last_modified: Optional[ProjectsLastModifiedCache] = None self.session = requests.Session() # Declare the USER_AGENT is more sysadm-friendly for the forge we list self.session.headers.update( {"Accept": "application/json", "User-Agent": USER_AGENT} ) self.incremental = incremental def state_from_dict(self, d: Dict[str, Dict[str, Any]]) -> SourceForgeListerState: subsitemaps = { k: datetime.date.fromisoformat(v) for k, v in d.get("subsitemap_last_modified", {}).items() } empty_projects = { k: datetime.date.fromisoformat(v) for k, v in d.get("empty_projects", {}).items() } return SourceForgeListerState( subsitemap_last_modified=subsitemaps, empty_projects=empty_projects ) def state_to_dict(self, state: SourceForgeListerState) -> Dict[str, Any]: return { "subsitemap_last_modified": { k: v.isoformat() for k, v in state.subsitemap_last_modified.items() }, "empty_projects": { k: v.isoformat() for k, v in state.empty_projects.items() }, } def projects_last_modified(self) -> ProjectsLastModifiedCache: if not self.incremental: # No point in loading the previous results if we're doing a full run return {} if self._project_last_modified is not None: return self._project_last_modified # We know there will be at least that many origins stream = stream_results( self.scheduler.get_listed_origins, self.lister_obj.id, limit=300_000 ) listed_origins = dict() # Projects can have slashes in them if they're subprojects, but the # mointpoint (last component) cannot. url_match = re.compile( r".*\.code\.sf\.net/(?P[^/]+)/(?P.+)/.*" ) bzr_url_match = re.compile( r"http://(?P[^/]+).bzr.sourceforge.net/bzrroot/([^/]+)" ) for origin in stream: url = origin.url match = url_match.match(url) if match is None: # Should be a bzr special endpoint match = bzr_url_match.match(url) assert match is not None matches = match.groupdict() project = matches["project"] namespace = "p" # no special namespacing for bzr projects else: matches = match.groupdict() namespace = matches["namespace"] project = matches["project"] # "Last modified" dates are the same across all VCS (tools, even) # within a project or subproject. An assertion here would be overkill. last_modified = origin.last_update assert last_modified is not None listed_origins[(namespace, project)] = last_modified.date() self._project_last_modified = listed_origins return listed_origins @throttling_retry( retry=retry_policy_generic, before_sleep=before_sleep_log(logger, logging.WARNING), ) def page_request(self, url, params) -> requests.Response: # Log listed URL to ease debugging logger.debug("Fetching URL %s with params %s", url, params) response = self.session.get(url, params=params) if response.status_code != 200: # Log response content to ease debugging logger.warning( "Unexpected HTTP status code %s for URL %s", response.status_code, response.url, ) # The lister must fail on blocking errors response.raise_for_status() return response def get_pages(self) -> Iterator[SourceForgeListerPage]: """ SourceForge has a main XML sitemap that lists its sharded sitemaps for all projects. Each XML sub-sitemap lists project pages, which are not unique per project: a project can have a wiki, a home, a git, an svn, etc. For each unique project, we query an API endpoint that lists (among other things) the tools associated with said project, some of which are the VCS used. Subprojects are considered separate projects. Lastly we use the information of which VCS are used to build the predictable clone URL for any given VCS. """ sitemap_contents = self.page_request(MAIN_SITEMAP_URL, {}).text tree = ElementTree.fromstring(sitemap_contents) for subsitemap in tree.iterfind(f"{SITEMAP_XML_NAMESPACE}sitemap"): last_modified_el = subsitemap.find(f"{SITEMAP_XML_NAMESPACE}lastmod") assert last_modified_el is not None and last_modified_el.text is not None last_modified = datetime.date.fromisoformat(last_modified_el.text) location = subsitemap.find(f"{SITEMAP_XML_NAMESPACE}loc") assert location is not None and location.text is not None sub_url = location.text if self.incremental: recorded_last_mod = self.state.subsitemap_last_modified.get(sub_url) if recorded_last_mod == last_modified: # The entire subsitemap hasn't changed, so none of its projects # have either, skip it. continue self.state.subsitemap_last_modified[sub_url] = last_modified subsitemap_contents = self.page_request(sub_url, {}).text subtree = ElementTree.fromstring(subsitemap_contents) yield from self._get_pages_from_subsitemap(subtree) def get_origins_from_page( self, page: SourceForgeListerPage ) -> Iterator[ListedOrigin]: assert self.lister_obj.id is not None for hit in page: last_modified: str = str(hit.last_modified) last_update: datetime.datetime = iso8601.parse_date(last_modified) yield ListedOrigin( lister_id=self.lister_obj.id, visit_type=hit.vcs.value, url=hit.url, last_update=last_update, ) def _get_pages_from_subsitemap( self, subtree: ElementTree.Element ) -> Iterator[SourceForgeListerPage]: projects: Set[ProjectNameT] = set() for project_block in subtree.iterfind(f"{SITEMAP_XML_NAMESPACE}url"): last_modified_block = project_block.find(f"{SITEMAP_XML_NAMESPACE}lastmod") assert last_modified_block is not None last_modified = last_modified_block.text location = project_block.find(f"{SITEMAP_XML_NAMESPACE}loc") assert location is not None project_url = location.text assert project_url is not None match = PROJ_URL_RE.match(project_url) if match: matches = match.groupdict() namespace = matches["namespace"] if namespace == "projects": # These have a `p`-namespaced counterpart, use that instead continue project = matches["project"] rest = matches["rest"] if rest.count("/") > 1: # This is a subproject. There exists no sub-subprojects. subproject_name = rest.rsplit("/", 2)[0] project = f"{project}/{subproject_name}" prev_len = len(projects) projects.add(project) if prev_len == len(projects): # Already seen continue pages = self._get_pages_for_project(namespace, project, last_modified) if pages: yield pages else: logger.debug("Project '%s' does not have any VCS", project) else: # Should almost always match, let's log it # The only ones that don't match are mostly specialized one-off URLs. msg = "Project URL '%s' does not match expected pattern" logger.warning(msg, project_url) def _get_pages_for_project( self, namespace, project, last_modified ) -> SourceForgeListerPage: endpoint = PROJECT_API_URL_FORMAT.format(namespace=namespace, project=project) empty_project_last_modified = self.state.empty_projects.get(endpoint) if empty_project_last_modified is not None: if last_modified == empty_project_last_modified.isoformat(): # Project has not changed, so is still empty, meaning it has # no VCS attached that we can archive. logger.debug(f"Project {namespace}/{project} is still empty") return [] if self.incremental: expected = self.projects_last_modified().get((namespace, project)) if expected is not None: if expected.isoformat() == last_modified: # Project has not changed logger.debug(f"Project {namespace}/{project} has not changed") return [] else: logger.debug(f"Project {namespace}/{project} was updated") else: msg = "New project during an incremental run: %s/%s" logger.debug(msg, namespace, project) try: res = self.page_request(endpoint, {}).json() except requests.HTTPError: # We've already logged in `page_request` return [] tools = res.get("tools") if tools is None: # This rarely happens, on very old URLs logger.warning("Project '%s' does not have any tools", endpoint) return [] hits = [] for tool in tools: tool_name = tool["name"] if tool_name not in VCS_NAMES: continue + if tool_name == VcsNames.CVS.value: + # CVS projects are different from other VCS ones, they use the rsync + # protocol, a list of modules needs to be fetched from an info page + # and multiple origin URLs can be produced for a same project. + cvs_info_url = f"http://{project}.cvs.sourceforge.net" + try: + response = self.page_request(cvs_info_url, params={}) + except requests.HTTPError: + logger.warning( + "CVS info page could not be fetched, skipping project '%s'", + project, + ) + continue + else: + bs = BeautifulSoup(response.text, features="html.parser") + cvs_base_url = "rsync://a.cvs.sourceforge.net/cvsroot" + for text in [b.text for b in bs.find_all("b")]: + match = re.search(fr".*/cvsroot/{project} co -P (.+)", text) + if match is not None: + module = match.group(1) + url = f"{cvs_base_url}/{project}/{module}" + hits.append( + SourceForgeListerEntry( + vcs=VcsNames(tool_name), + url=url, + last_modified=last_modified, + ) + ) + continue url = CLONE_URL_FORMAT.format( vcs=tool_name, namespace=namespace, project=project, mount_point=tool["mount_point"], ) if tool_name == VcsNames.MERCURIAL.value: # SourceForge does not yet support anonymous HTTPS cloning for Mercurial # See https://sourceforge.net/p/forge/feature-requests/727/ url = url.replace("https://", "http://") if tool_name == VcsNames.BAZAAR.value: # SourceForge has removed support for bzr and only keeps legacy projects # around at a separate (also not https) URL. Bzr projects are very rare # and a lot of them are 404 now. url = f"http://{project}.bzr.sourceforge.net/bzrroot/{project}" entry = SourceForgeListerEntry( vcs=VcsNames(tool_name), url=url, last_modified=last_modified ) hits.append(entry) if not hits: date = datetime.date.fromisoformat(last_modified) self.state.empty_projects[endpoint] = date else: self.state.empty_projects.pop(endpoint, None) return hits diff --git a/swh/lister/sourceforge/tests/data/aaron.html b/swh/lister/sourceforge/tests/data/aaron.html new file mode 100644 index 0000000..5b1c226 --- /dev/null +++ b/swh/lister/sourceforge/tests/data/aaron.html @@ -0,0 +1,23 @@ + + + + + + CVS Info for project aaron + + + + + + +

The aaron project's CVS data is in read-only mode, so the project may have switched over to another source-code-management system. To check, visit the Project Summary Page for aaron and see if the menubar lists a newer code repository, such as SVN or Git. + +

The CVS data can be accessed as follows. +You can run a per-module CVS checkout via pserver protocol: +

  • cvs -z3 -d:pserver:anonymous@a.cvs.sourceforge.net:/cvsroot/aaron co -P aaron
  • +
  • cvs -z3 -d:pserver:anonymous@a.cvs.sourceforge.net:/cvsroot/aaron co -P www
  • +

    You can view a list of files or copy all the CVS repository data via rsync (the 1st command lists the files, the 2nd copies): +

  • rsync -a a.cvs.sourceforge.net::cvsroot/aaron/
  • +
  • rsync -ai a.cvs.sourceforge.net::cvsroot/aaron/ /my/local/dest/dir/
  • + +

    If you are a project admin for aaron, you can request that this page redirect to another repo on your project by submitting a support request. diff --git a/swh/lister/sourceforge/tests/data/aaron.json b/swh/lister/sourceforge/tests/data/aaron.json new file mode 100644 index 0000000..8eea8e9 --- /dev/null +++ b/swh/lister/sourceforge/tests/data/aaron.json @@ -0,0 +1,236 @@ +{ + "shortname": "aaron", + "name": "Aaron: the app, service, and net monitor", + "_id": "5139010d5fcbc97960fd66bb", + "url": "https://sourceforge.net/p/aaron/", + "private": false, + "short_description": "Aaron is an application, service, and network availability monitoring and alert daemon. Notification of unavailable services, networks, etc., levels is sent to the appropriate roles. Aaron is highly customizable enterprise class monitoring software.", + "creation_date": "2001-06-24", + "summary": "", + "external_homepage": "http://aaron.sourceforge.net", + "video_url": "", + "socialnetworks": [], + "status": "active", + "moved_to_url": "", + "preferred_support_tool": "", + "preferred_support_url": "", + "developers": [ + { + "username": "kapelmeister", + "name": "Steve Nickels", + "url": "https://sourceforge.net/u/kapelmeister/" + }, + { + "username": "thetitan", + "name": "Sean Chittenden", + "url": "https://sourceforge.net/u/thetitan/" + }, + { + "username": "stwalker", + "name": "Scott Walker", + "url": "https://sourceforge.net/u/stwalker/" + } + ], + "tools": [ + { + "name": "support", + "mount_point": "support", + "url": "/p/aaron/support/", + "icons": { + "24": "images/sftheme/24x24/blog_24.png", + "32": "images/sftheme/32x32/blog_32.png", + "48": "images/sftheme/48x48/blog_48.png" + }, + "installable": false, + "tool_label": "Support", + "mount_label": "Support" + }, + { + "name": "mailman", + "mount_point": "mailman", + "url": "/p/aaron/mailman/", + "icons": { + "24": "images/forums_24.png", + "32": "images/forums_32.png", + "48": "images/forums_48.png" + }, + "installable": false, + "tool_label": "Mailing Lists", + "mount_label": "Mailing Lists" + }, + { + "name": "reviews", + "mount_point": "reviews", + "url": "/p/aaron/reviews/", + "icons": { + "24": "images/sftheme/24x24/blog_24.png", + "32": "images/sftheme/32x32/blog_32.png", + "48": "images/sftheme/48x48/blog_48.png" + }, + "installable": false, + "tool_label": "Reviews", + "mount_label": "Reviews" + }, + { + "name": "wiki", + "mount_point": "wiki", + "url": "/p/aaron/wiki/", + "icons": { + "24": "images/wiki_24.png", + "32": "images/wiki_32.png", + "48": "images/wiki_48.png" + }, + "installable": true, + "tool_label": "Wiki", + "mount_label": "Wiki" + }, + { + "name": "summary", + "mount_point": "summary", + "url": "/p/aaron/summary/", + "icons": { + "24": "images/sftheme/24x24/blog_24.png", + "32": "images/sftheme/32x32/blog_32.png", + "48": "images/sftheme/48x48/blog_48.png" + }, + "installable": false, + "tool_label": "Summary", + "mount_label": "Summary", + "sourceforge_group_id": 29993 + }, + { + "name": "files-sf", + "mount_point": "files", + "url": "/p/aaron/files/", + "icons": { + "24": "images/downloads_24.png", + "32": "images/downloads_32.png", + "48": "images/downloads_48.png" + }, + "installable": false, + "tool_label": "Files", + "mount_label": "Files" + }, + { + "name": "cvs", + "mount_point": "code", + "url": "/p/aaron/code/", + "icons": { + "24": "images/code_24.png", + "32": "images/code_32.png", + "48": "images/code_48.png" + }, + "installable": false, + "tool_label": "CVS", + "mount_label": "Code" + }, + { + "name": "activity", + "mount_point": "activity", + "url": "/p/aaron/activity/", + "icons": { + "24": "images/admin_24.png", + "32": "images/admin_32.png", + "48": "images/admin_48.png" + }, + "installable": false, + "tool_label": "Tool", + "mount_label": "Activity" + }, + { + "name": "discussion", + "mount_point": "discussion", + "url": "/p/aaron/discussion/", + "icons": { + "24": "images/forums_24.png", + "32": "images/forums_32.png", + "48": "images/forums_48.png" + }, + "installable": true, + "tool_label": "Discussion", + "mount_label": "Discussion" + } + ], + "labels": [], + "categories": { + "audience": [ + { + "id": 4, + "shortname": "sysadmins", + "fullname": "System Administrators", + "fullpath": "Intended Audience :: by End-User Class :: System Administrators" + } + ], + "developmentstatus": [ + { + "id": 8, + "shortname": "prealpha", + "fullname": "2 - Pre-Alpha", + "fullpath": "Development Status :: 2 - Pre-Alpha" + }, + { + "id": 7, + "shortname": "planning", + "fullname": "1 - Planning", + "fullpath": "Development Status :: 1 - Planning" + } + ], + "environment": [ + { + "id": 238, + "shortname": "daemon", + "fullname": "Non-interactive (Daemon)", + "fullpath": "User Interface :: Non-interactive (Daemon)" + } + ], + "language": [ + { + "id": 164, + "shortname": "c", + "fullname": "C", + "fullpath": "Programming Language :: C" + }, + { + "id": 293, + "shortname": "ruby", + "fullname": "Ruby", + "fullpath": "Programming Language :: Ruby" + } + ], + "license": [ + { + "id": 296, + "shortname": "apache", + "fullname": "Apache Software License", + "fullpath": "License :: OSI-Approved Open Source :: Apache Software License" + } + ], + "translation": [ + { + "id": 275, + "shortname": "english", + "fullname": "English", + "fullpath": "Translations :: English" + } + ], + "os": [ + { + "id": 235, + "shortname": "independent", + "fullname": "OS Independent (Written in an interpreted language)", + "fullpath": "Operating System :: Grouping and Descriptive Categories :: OS Independent (Written in an interpreted language)" + } + ], + "database": [], + "topic": [ + { + "id": 152, + "shortname": "monitoring", + "fullname": "Monitoring", + "fullpath": "Topic :: System :: Networking :: Monitoring" + } + ] + }, + "icon_url": null, + "screenshots": [] +} \ No newline at end of file diff --git a/swh/lister/sourceforge/tests/data/subsitemap-0.xml b/swh/lister/sourceforge/tests/data/subsitemap-0.xml index 5f2cba8..451554a 100644 --- a/swh/lister/sourceforge/tests/data/subsitemap-0.xml +++ b/swh/lister/sourceforge/tests/data/subsitemap-0.xml @@ -1,69 +1,84 @@ + + https://sourceforge.net/projects/aaron/files/ + 2013-03-07 + daily + + + https://sourceforge.net/p/aaron/home/ + 2013-03-07 + daily + + + https://sourceforge.net/p/aaron/tickets/ + 2013-03-07 + daily + https://sourceforge.net/projects/os3dmodels/files/ 2017-03-31 daily https://sourceforge.net/p/os3dmodels/home/ 2017-03-31 daily https://sourceforge.net/p/os3dmodels/tickets/ 2017-03-31 daily https://sourceforge.net/p/mramm/home/ 2019-04-04 daily https://sourceforge.net/p/mramm/todo/ 2019-04-04 daily https://sourceforge.net/p/mramm/notes/ 2019-04-04 daily https://sourceforge.net/p/mramm/reviews/ 2019-04-04 daily https://sourceforge.net/p/mramm/discussion/ 2019-04-04 daily https://sourceforge.net/adobe/adobexmp/home/ 2017-10-17 daily https://sourceforge.net/adobe/adobexmp/wiki/ 2017-10-17 daily https://sourceforge.net/adobe/adobexmp/discussion/ 2017-10-17 daily https://sourceforge.net/projects/backapps/files/ 2021-02-11 daily https://sourceforge.net/p/backapps/tickets/ 2021-02-11 daily diff --git a/swh/lister/sourceforge/tests/test_lister.py b/swh/lister/sourceforge/tests/test_lister.py index 9bb9a7c..3dfa595 100644 --- a/swh/lister/sourceforge/tests/test_lister.py +++ b/swh/lister/sourceforge/tests/test_lister.py @@ -1,434 +1,450 @@ # Copyright (C) 2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime import functools import json from pathlib import Path import re from iso8601 import iso8601 import pytest from requests.exceptions import HTTPError from swh.lister import USER_AGENT from swh.lister.sourceforge.lister import ( MAIN_SITEMAP_URL, PROJECT_API_URL_FORMAT, SourceForgeLister, SourceForgeListerState, ) from swh.lister.tests.test_utils import assert_sleep_calls from swh.lister.utils import WAIT_EXP_BASE # Mapping of project name to namespace from swh.scheduler.model import ListedOrigin TEST_PROJECTS = { + "aaron": "p", "adobexmp": "adobe", "backapps": "p", "backapps/website": "p", "bzr-repo": "p", "mojunk": "p", "mramm": "p", "os3dmodels": "p", "random-mercurial": "p", } URLS_MATCHER = { PROJECT_API_URL_FORMAT.format(namespace=namespace, project=project): project for project, namespace in TEST_PROJECTS.items() } def get_main_sitemap(datadir): return Path(datadir, "main-sitemap.xml").read_text() def get_subsitemap_0(datadir): return Path(datadir, "subsitemap-0.xml").read_text() def get_subsitemap_1(datadir): return Path(datadir, "subsitemap-1.xml").read_text() def get_project_json(datadir, request, context): url = request.url project = URLS_MATCHER.get(url) assert project is not None, f"Url '{url}' could not be matched" project = project.replace("/", "-") return json.loads(Path(datadir, f"{project}.json").read_text()) +def get_cvs_info_page(datadir): + return Path(datadir, "aaron.html").read_text() + + def _check_request_headers(request): return request.headers.get("User-Agent") == USER_AGENT def _check_listed_origins(lister, swh_scheduler): scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results res = {o.url: (o.visit_type, str(o.last_update.date())) for o in scheduler_origins} assert res == { "https://svn.code.sf.net/p/backapps/website/code": ("svn", "2021-02-11"), "https://git.code.sf.net/p/os3dmodels/git": ("git", "2017-03-31"), "https://svn.code.sf.net/p/os3dmodels/svn": ("svn", "2017-03-31"), "https://git.code.sf.net/p/mramm/files": ("git", "2019-04-04"), "https://git.code.sf.net/p/mramm/git": ("git", "2019-04-04"), "https://svn.code.sf.net/p/mramm/svn": ("svn", "2019-04-04"), "https://git.code.sf.net/p/mojunk/git": ("git", "2017-12-31"), "https://git.code.sf.net/p/mojunk/git2": ("git", "2017-12-31"), "https://svn.code.sf.net/p/mojunk/svn": ("svn", "2017-12-31"), "http://hg.code.sf.net/p/random-mercurial/hg": ("hg", "2019-05-02"), "http://bzr-repo.bzr.sourceforge.net/bzrroot/bzr-repo": ("bzr", "2021-01-27"), + "rsync://a.cvs.sourceforge.net/cvsroot/aaron/aaron": ("cvs", "2013-03-07"), + "rsync://a.cvs.sourceforge.net/cvsroot/aaron/www": ("cvs", "2013-03-07"), } def test_sourceforge_lister_full(swh_scheduler, requests_mock, datadir): """ Simulate a full listing of an artificially restricted sourceforge. There are 5 different projects, spread over two sub-sitemaps, a few of which have multiple VCS listed, one has none, one is outside of the standard `/p/` namespace, some with custom mount points. All non-interesting but related entries have been kept. """ lister = SourceForgeLister(scheduler=swh_scheduler) requests_mock.get( MAIN_SITEMAP_URL, text=get_main_sitemap(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-0.xml", text=get_subsitemap_0(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-1.xml", text=get_subsitemap_1(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( re.compile("https://sourceforge.net/rest/.*"), json=functools.partial(get_project_json, datadir), additional_matcher=_check_request_headers, ) + requests_mock.get( + re.compile("http://aaron.cvs.sourceforge.net/"), + text=get_cvs_info_page(datadir), + additional_matcher=_check_request_headers, + ) stats = lister.run() # - os3dmodels (2 repos), # - mramm (3 repos), # - mojunk (3 repos), # - backapps/website (1 repo), # - random-mercurial (1 repo). # - bzr-repo (1 repo). # adobe and backapps itself have no repos. - assert stats.pages == 6 - assert stats.origins == 11 + assert stats.pages == 7 + assert stats.origins == 13 expected_state = { "subsitemap_last_modified": { "https://sourceforge.net/allura_sitemap/sitemap-0.xml": "2021-03-18", "https://sourceforge.net/allura_sitemap/sitemap-1.xml": "2021-03-18", }, "empty_projects": { "https://sourceforge.net/rest/p/backapps": "2021-02-11", "https://sourceforge.net/rest/adobe/adobexmp": "2017-10-17", }, } assert lister.state_to_dict(lister.state) == expected_state _check_listed_origins(lister, swh_scheduler) def test_sourceforge_lister_incremental(swh_scheduler, requests_mock, datadir, mocker): """ Simulate an incremental listing of an artificially restricted sourceforge. Same dataset as the full run, because it's enough to validate the different cases. """ lister = SourceForgeLister(scheduler=swh_scheduler, incremental=True) requests_mock.get( MAIN_SITEMAP_URL, text=get_main_sitemap(datadir), additional_matcher=_check_request_headers, ) def not_called(request, *args, **kwargs): raise AssertionError(f"Should not have been called: '{request.url}'") requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-0.xml", text=get_subsitemap_0(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-1.xml", text=not_called, additional_matcher=_check_request_headers, ) def filtered_get_project_json(request, context): # These projects should not be requested again assert URLS_MATCHER[request.url] not in {"adobe", "mojunk"} return get_project_json(datadir, request, context) requests_mock.get( re.compile("https://sourceforge.net/rest/.*"), json=filtered_get_project_json, additional_matcher=_check_request_headers, ) + requests_mock.get( + re.compile("http://aaron.cvs.sourceforge.net/"), + text=get_cvs_info_page(datadir), + additional_matcher=_check_request_headers, + ) + faked_listed_origins = [ # mramm: changed ListedOrigin( lister_id=lister.lister_obj.id, visit_type="git", url="https://git.code.sf.net/p/mramm/files", last_update=iso8601.parse_date("2019-01-01"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="git", url="https://git.code.sf.net/p/mramm/git", last_update=iso8601.parse_date("2019-01-01"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="svn", url="https://svn.code.sf.net/p/mramm/svn", last_update=iso8601.parse_date("2019-01-01"), ), # stayed the same, even though its subsitemap has changed ListedOrigin( lister_id=lister.lister_obj.id, visit_type="git", url="https://git.code.sf.net/p/os3dmodels/git", last_update=iso8601.parse_date("2017-03-31"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="svn", url="https://svn.code.sf.net/p/os3dmodels/svn", last_update=iso8601.parse_date("2017-03-31"), ), # others: stayed the same, should be skipped ListedOrigin( lister_id=lister.lister_obj.id, visit_type="git", url="https://git.code.sf.net/p/mojunk/git", last_update=iso8601.parse_date("2017-12-31"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="git", url="https://git.code.sf.net/p/mojunk/git2", last_update=iso8601.parse_date("2017-12-31"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="svn", url="https://svn.code.sf.net/p/mojunk/svn", last_update=iso8601.parse_date("2017-12-31"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="svn", url="https://svn.code.sf.net/p/backapps/website/code", last_update=iso8601.parse_date("2021-02-11"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="hg", url="http://hg.code.sf.net/p/random-mercurial/hg", last_update=iso8601.parse_date("2019-05-02"), ), ListedOrigin( lister_id=lister.lister_obj.id, visit_type="bzr", url="http://bzr-repo.bzr.sourceforge.net/bzrroot/bzr-repo", last_update=iso8601.parse_date("2021-01-27"), ), ] swh_scheduler.record_listed_origins(faked_listed_origins) to_date = datetime.date.fromisoformat faked_state = SourceForgeListerState( subsitemap_last_modified={ # changed "https://sourceforge.net/allura_sitemap/sitemap-0.xml": to_date( "2021-02-18" ), # stayed the same "https://sourceforge.net/allura_sitemap/sitemap-1.xml": to_date( "2021-03-18" ), }, empty_projects={ "https://sourceforge.net/rest/p/backapps": to_date("2020-02-11"), "https://sourceforge.net/rest/adobe/adobexmp": to_date("2017-10-17"), }, ) lister.state = faked_state stats = lister.run() # - mramm (3 repos), # changed - assert stats.pages == 1 - assert stats.origins == 3 + assert stats.pages == 2 + assert stats.origins == 5 expected_state = { "subsitemap_last_modified": { "https://sourceforge.net/allura_sitemap/sitemap-0.xml": "2021-03-18", "https://sourceforge.net/allura_sitemap/sitemap-1.xml": "2021-03-18", }, "empty_projects": { "https://sourceforge.net/rest/p/backapps": "2021-02-11", # changed "https://sourceforge.net/rest/adobe/adobexmp": "2017-10-17", }, } assert lister.state_to_dict(lister.state) == expected_state # origins have been updated _check_listed_origins(lister, swh_scheduler) def test_sourceforge_lister_retry(swh_scheduler, requests_mock, mocker, datadir): lister = SourceForgeLister(scheduler=swh_scheduler) # Exponential retries take a long time, so stub time.sleep mocked_sleep = mocker.patch.object(lister.page_request.retry, "sleep") requests_mock.get( MAIN_SITEMAP_URL, [ {"status_code": 429}, {"status_code": 429}, {"text": get_main_sitemap(datadir)}, ], additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-0.xml", [{"status_code": 429}, {"text": get_subsitemap_0(datadir), "status_code": 301}], additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-1.xml", [{"status_code": 429}, {"text": get_subsitemap_1(datadir)}], additional_matcher=_check_request_headers, ) requests_mock.get( re.compile("https://sourceforge.net/rest/.*"), [{"status_code": 429}, {"json": functools.partial(get_project_json, datadir)}], additional_matcher=_check_request_headers, ) + requests_mock.get( + re.compile("http://aaron.cvs.sourceforge.net/"), + text=get_cvs_info_page(datadir), + additional_matcher=_check_request_headers, + ) + stats = lister.run() # - os3dmodels (2 repos), # - mramm (3 repos), # - mojunk (3 repos), # - backapps/website (1 repo), # - random-mercurial (1 repo). # - bzr-repo (1 repo). # adobe and backapps itself have no repos. - assert stats.pages == 6 - assert stats.origins == 11 + assert stats.pages == 7 + assert stats.origins == 13 - scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - assert {o.url: o.visit_type for o in scheduler_origins} == { - "https://svn.code.sf.net/p/backapps/website/code": "svn", - "https://git.code.sf.net/p/os3dmodels/git": "git", - "https://svn.code.sf.net/p/os3dmodels/svn": "svn", - "https://git.code.sf.net/p/mramm/files": "git", - "https://git.code.sf.net/p/mramm/git": "git", - "https://svn.code.sf.net/p/mramm/svn": "svn", - "https://git.code.sf.net/p/mojunk/git": "git", - "https://git.code.sf.net/p/mojunk/git2": "git", - "https://svn.code.sf.net/p/mojunk/svn": "svn", - "http://hg.code.sf.net/p/random-mercurial/hg": "hg", - "http://bzr-repo.bzr.sourceforge.net/bzrroot/bzr-repo": "bzr", - } + _check_listed_origins(lister, swh_scheduler) # Test `time.sleep` is called with exponential retries assert_sleep_calls(mocker, mocked_sleep, [1, WAIT_EXP_BASE, 1, 1]) @pytest.mark.parametrize("status_code", [500, 503, 504, 403, 404]) def test_sourceforge_lister_http_error( swh_scheduler, requests_mock, status_code, mocker ): lister = SourceForgeLister(scheduler=swh_scheduler) # Exponential retries take a long time, so stub time.sleep mocked_sleep = mocker.patch.object(lister.page_request.retry, "sleep") requests_mock.get(MAIN_SITEMAP_URL, status_code=status_code) with pytest.raises(HTTPError): lister.run() exp_retries = [] if status_code >= 500: exp_retries = [1.0, 10.0, 100.0, 1000.0] assert_sleep_calls(mocker, mocked_sleep, exp_retries) @pytest.mark.parametrize("status_code", [500, 503, 504, 403, 404]) def test_sourceforge_lister_project_error( datadir, swh_scheduler, requests_mock, status_code, mocker ): lister = SourceForgeLister(scheduler=swh_scheduler) # Exponential retries take a long time, so stub time.sleep mocker.patch.object(lister.page_request.retry, "sleep") requests_mock.get( MAIN_SITEMAP_URL, text=get_main_sitemap(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-0.xml", text=get_subsitemap_0(datadir), additional_matcher=_check_request_headers, ) requests_mock.get( "https://sourceforge.net/allura_sitemap/sitemap-1.xml", text=get_subsitemap_1(datadir), additional_matcher=_check_request_headers, ) # Request mocks precedence is LIFO requests_mock.get( re.compile("https://sourceforge.net/rest/.*"), json=functools.partial(get_project_json, datadir), additional_matcher=_check_request_headers, ) # Make all `mramm` requests fail # `mramm` is in subsitemap 0, which ensures we keep listing after an error. requests_mock.get( re.compile("https://sourceforge.net/rest/p/mramm"), status_code=status_code ) + # Make request to CVS info page fail + requests_mock.get( + re.compile("http://aaron.cvs.sourceforge.net/"), status_code=status_code + ) + stats = lister.run() # - os3dmodels (2 repos), # - mojunk (3 repos), # - backapps/website (1 repo), # - random-mercurial (1 repo). # - bzr-repo (1 repo). # adobe and backapps itself have no repos. # Did *not* list mramm assert stats.pages == 5 assert stats.origins == 8 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results res = {o.url: (o.visit_type, str(o.last_update.date())) for o in scheduler_origins} # Ensure no `mramm` origins are listed, but all others are. assert res == { "https://svn.code.sf.net/p/backapps/website/code": ("svn", "2021-02-11"), "https://git.code.sf.net/p/os3dmodels/git": ("git", "2017-03-31"), "https://svn.code.sf.net/p/os3dmodels/svn": ("svn", "2017-03-31"), "https://git.code.sf.net/p/mojunk/git": ("git", "2017-12-31"), "https://git.code.sf.net/p/mojunk/git2": ("git", "2017-12-31"), "https://svn.code.sf.net/p/mojunk/svn": ("svn", "2017-12-31"), "http://hg.code.sf.net/p/random-mercurial/hg": ("hg", "2019-05-02"), "http://bzr-repo.bzr.sourceforge.net/bzrroot/bzr-repo": ("bzr", "2021-01-27"), }