diff --git a/PKG-INFO b/PKG-INFO index 48f85b5..487ccb8 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,125 +1,125 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 4.1.1 +Version: 4.2.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.golang` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` - `swh.lister.gogs` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `golang`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO index 48f85b5..487ccb8 100644 --- a/swh.lister.egg-info/PKG-INFO +++ b/swh.lister.egg-info/PKG-INFO @@ -1,125 +1,125 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 4.1.1 +Version: 4.2.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.golang` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` - `swh.lister.gogs` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `golang`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/SOURCES.txt b/swh.lister.egg-info/SOURCES.txt index 4414e93..863873c 100644 --- a/swh.lister.egg-info/SOURCES.txt +++ b/swh.lister.egg-info/SOURCES.txt @@ -1,419 +1,422 @@ .git-blame-ignore-revs .gitignore .pre-commit-config.yaml ACKNOWLEDGEMENTS CODE_OF_CONDUCT.md CONTRIBUTORS LICENSE MANIFEST.in Makefile README.md conftest.py mypy.ini pyproject.toml pytest.ini requirements-swh.txt requirements-test.txt requirements.txt setup.cfg setup.py tox.ini docs/.gitignore docs/Makefile docs/cli.rst docs/conf.py docs/index.rst docs/new_lister_template.py docs/run_a_new_lister.rst docs/save_forge.rst docs/tutorial.rst docs/_static/.placeholder docs/_templates/.placeholder docs/images/new_base.png docs/images/new_bitbucket_lister.png docs/images/new_github_lister.png docs/images/old_github_lister.png sql/crawler.sql sql/pimp_db.sql swh/__init__.py swh.lister.egg-info/PKG-INFO swh.lister.egg-info/SOURCES.txt swh.lister.egg-info/dependency_links.txt swh.lister.egg-info/entry_points.txt swh.lister.egg-info/requires.txt swh.lister.egg-info/top_level.txt swh/lister/__init__.py swh/lister/cli.py swh/lister/pattern.py swh/lister/py.typed swh/lister/utils.py swh/lister/arch/__init__.py swh/lister/arch/lister.py swh/lister/arch/tasks.py swh/lister/arch/tests/__init__.py swh/lister/arch/tests/test_lister.py swh/lister/arch/tests/test_tasks.py swh/lister/arch/tests/data/fake_archlinux_archives_init.sh swh/lister/arch/tests/data/https_archive.archlinux.org/packages_d_dialog swh/lister/arch/tests/data/https_archive.archlinux.org/packages_g_gnome-code-assistance swh/lister/arch/tests/data/https_archive.archlinux.org/packages_g_gzip swh/lister/arch/tests/data/https_archive.archlinux.org/packages_l_libasyncns swh/lister/arch/tests/data/https_archive.archlinux.org/packages_m_mercurial swh/lister/arch/tests/data/https_archive.archlinux.org/packages_p_python-hglib swh/lister/arch/tests/data/https_archive.archlinux.org/repos_last_community_os_x86_64_community.files.tar.gz swh/lister/arch/tests/data/https_archive.archlinux.org/repos_last_core_os_x86_64_core.files.tar.gz swh/lister/arch/tests/data/https_archive.archlinux.org/repos_last_extra_os_x86_64_extra.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/aarch64_community_community.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/aarch64_core_core.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/aarch64_extra_extra.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/armv7h_community_community.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/armv7h_core_core.files.tar.gz swh/lister/arch/tests/data/https_uk.mirror.archlinuxarm.org/armv7h_extra_extra.files.tar.gz swh/lister/aur/__init__.py swh/lister/aur/lister.py swh/lister/aur/tasks.py swh/lister/aur/tests/__init__.py swh/lister/aur/tests/test_lister.py swh/lister/aur/tests/test_tasks.py swh/lister/aur/tests/data/fake_aur_packages.sh swh/lister/aur/tests/data/packages-meta-v1.json.gz swh/lister/bitbucket/__init__.py swh/lister/bitbucket/lister.py swh/lister/bitbucket/tasks.py swh/lister/bitbucket/tests/__init__.py swh/lister/bitbucket/tests/test_lister.py swh/lister/bitbucket/tests/test_tasks.py swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json swh/lister/bower/__init__.py swh/lister/bower/lister.py swh/lister/bower/tasks.py swh/lister/bower/tests/__init__.py swh/lister/bower/tests/test_lister.py swh/lister/bower/tests/test_tasks.py swh/lister/bower/tests/data/https_registry.bower.io/packages swh/lister/cgit/__init__.py swh/lister/cgit/lister.py swh/lister/cgit/tasks.py swh/lister/cgit/tests/__init__.py swh/lister/cgit/tests/repo_list.txt swh/lister/cgit/tests/test_lister.py swh/lister/cgit/tests/test_tasks.py swh/lister/cgit/tests/data/https_git.acdw.net/README swh/lister/cgit/tests/data/https_git.acdw.net/cgit swh/lister/cgit/tests/data/https_git.acdw.net/foo swh/lister/cgit/tests/data/https_git.acdw.net/foo_summary swh/lister/cgit/tests/data/https_git.acdw.net/sfeed swh/lister/cgit/tests/data/https_git.acdw.net/sfeed_summary swh/lister/cgit/tests/data/https_git.baserock.org/cgit swh/lister/cgit/tests/data/https_git.eclipse.org/c swh/lister/cgit/tests/data/https_git.savannah.gnu.org/README swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit_elisp-es.git swh/lister/cgit/tests/data/https_git.tizen/README swh/lister/cgit/tests/data/https_git.tizen/cgit swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=100 swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=50 swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Users swh/lister/cgit/tests/data/https_git.tizen/cgit_Lock-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-base swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-mc1n2 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e3250 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e4x12 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_devices_nfc-plugin-nxp swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_intel_mfld_bootstub-mfld-blackbay swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_mtdev swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_opengl-es-virtual-drv swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libdrm swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libnl swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_xorg_driver_xserver-xorg-misc swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-gallery-efl swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-homescreen-efl swh/lister/cgit/tests/data/https_jff.email/cgit swh/lister/conda/__init__.py swh/lister/conda/lister.py swh/lister/conda/tasks.py swh/lister/conda/tests/__init__.py swh/lister/conda/tests/test_lister.py swh/lister/conda/tests/test_tasks.py swh/lister/conda/tests/data/https_conda.anaconda.org/conda-forge_linux-64_repodata.json.bz2 swh/lister/conda/tests/data/https_repo.anaconda.com/pkgs_free_linux-64_repodata.json.bz2 swh/lister/conda/tests/data/https_repo.anaconda.com/pkgs_free_osx-64_repodata.json.bz2 swh/lister/conda/tests/data/https_repo.anaconda.com/pkgs_free_win-64_repodata.json.bz2 swh/lister/conda/tests/data/https_repo.anaconda.com/pkgs_main_linux-64_repodata.json.bz2 swh/lister/conda/tests/data/https_repo.anaconda.com/pkgs_pro_linux-64_repodata.json.bz2 swh/lister/cpan/__init__.py swh/lister/cpan/lister.py swh/lister/cpan/tasks.py swh/lister/cpan/tests/__init__.py swh/lister/cpan/tests/test_lister.py swh/lister/cpan/tests/test_tasks.py swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page1 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page2 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page3 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page4 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_release__search swh/lister/cran/__init__.py swh/lister/cran/list_all_packages.R swh/lister/cran/lister.py swh/lister/cran/tasks.py swh/lister/cran/tests/__init__.py swh/lister/cran/tests/test_lister.py swh/lister/cran/tests/test_tasks.py swh/lister/cran/tests/data/list-r-packages.json swh/lister/crates/__init__.py swh/lister/crates/lister.py swh/lister/crates/tasks.py swh/lister/crates/tests/__init__.py swh/lister/crates/tests/test_lister.py swh/lister/crates/tests/test_tasks.py swh/lister/crates/tests/data/fake_crates_repository_init.sh swh/lister/crates/tests/data/https_static.crates.io/db-dump.tar.gz swh/lister/crates/tests/data/https_static.crates.io/db-dump.tar.gz_visit1 swh/lister/debian/__init__.py swh/lister/debian/lister.py swh/lister/debian/tasks.py swh/lister/debian/tests/__init__.py swh/lister/debian/tests/test_lister.py swh/lister/debian/tests/test_tasks.py swh/lister/debian/tests/data/Sources_bullseye swh/lister/debian/tests/data/Sources_buster swh/lister/debian/tests/data/Sources_stretch swh/lister/gitea/__init__.py swh/lister/gitea/lister.py swh/lister/gitea/tasks.py swh/lister/gitea/tests/__init__.py swh/lister/gitea/tests/test_lister.py swh/lister/gitea/tests/test_tasks.py swh/lister/gitea/tests/data/https_try.gitea.io/repos_page1 swh/lister/gitea/tests/data/https_try.gitea.io/repos_page2 swh/lister/github/__init__.py swh/lister/github/lister.py swh/lister/github/tasks.py swh/lister/github/utils.py swh/lister/github/tests/__init__.py swh/lister/github/tests/test_lister.py swh/lister/github/tests/test_tasks.py swh/lister/gitlab/__init__.py swh/lister/gitlab/lister.py swh/lister/gitlab/tasks.py swh/lister/gitlab/tests/__init__.py swh/lister/gitlab/tests/test_lister.py swh/lister/gitlab/tests/test_tasks.py swh/lister/gitlab/tests/data/https_foss.heptapod.net/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page2.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page3.json swh/lister/gitlab/tests/data/https_gitlab.com/api_response_page1.json swh/lister/gnu/__init__.py swh/lister/gnu/lister.py swh/lister/gnu/tasks.py swh/lister/gnu/tree.py swh/lister/gnu/tests/__init__.py swh/lister/gnu/tests/test_lister.py swh/lister/gnu/tests/test_tasks.py swh/lister/gnu/tests/test_tree.py swh/lister/gnu/tests/data/tree.json swh/lister/gnu/tests/data/tree.min.json swh/lister/gnu/tests/data/https_ftp.gnu.org/tree.json.gz swh/lister/gogs/__init__.py swh/lister/gogs/lister.py swh/lister/gogs/tasks.py swh/lister/gogs/tests/__init__.py swh/lister/gogs/tests/test_lister.py swh/lister/gogs/tests/test_tasks.py swh/lister/gogs/tests/data/https_try.gogs.io/repos_page1 swh/lister/gogs/tests/data/https_try.gogs.io/repos_page2 swh/lister/gogs/tests/data/https_try.gogs.io/repos_page3 swh/lister/gogs/tests/data/https_try.gogs.io/repos_page4 swh/lister/golang/__init__.py swh/lister/golang/lister.py swh/lister/golang/tasks.py swh/lister/golang/tests/__init__.py swh/lister/golang/tests/test_lister.py swh/lister/golang/tests/test_tasks.py swh/lister/golang/tests/data/page-1.txt swh/lister/golang/tests/data/page-2.txt swh/lister/golang/tests/data/page-3.txt swh/lister/hackage/__init__.py swh/lister/hackage/lister.py swh/lister/hackage/tasks.py swh/lister/hackage/tests/__init__.py swh/lister/hackage/tests/test_lister.py swh/lister/hackage/tests/test_tasks.py swh/lister/hackage/tests/data/https_fake49.haskell.org/packages_search_0 swh/lister/hackage/tests/data/https_fake51.haskell.org/packages_search_0 swh/lister/hackage/tests/data/https_fake51.haskell.org/packages_search_1 swh/lister/hackage/tests/data/https_hackage.haskell.org/packages_search_0 swh/lister/hackage/tests/data/https_hackage.haskell.org/packages_search_1 swh/lister/hackage/tests/data/https_hackage.haskell.org/packages_search_2 swh/lister/launchpad/__init__.py swh/lister/launchpad/lister.py swh/lister/launchpad/tasks.py swh/lister/launchpad/tests/__init__.py swh/lister/launchpad/tests/conftest.py swh/lister/launchpad/tests/test_lister.py swh/lister/launchpad/tests/test_tasks.py swh/lister/launchpad/tests/data/launchpad_bzr_response.json swh/lister/launchpad/tests/data/launchpad_response1.json swh/lister/launchpad/tests/data/launchpad_response2.json swh/lister/maven/README.md swh/lister/maven/__init__.py swh/lister/maven/lister.py swh/lister/maven/tasks.py swh/lister/maven/tests/__init__.py swh/lister/maven/tests/test_lister.py swh/lister/maven/tests/test_tasks.py swh/lister/maven/tests/data/citrus-parent-3.0.7.pom +swh/lister/maven/tests/data/sprova4j-0.1.0.invalidurl.pom swh/lister/maven/tests/data/sprova4j-0.1.0.malformed.pom swh/lister/maven/tests/data/http_indexes/export_full.fld swh/lister/maven/tests/data/http_indexes/export_incr_first.fld swh/lister/maven/tests/data/http_indexes/export_null_mtime.fld swh/lister/maven/tests/data/https_api.github.com/repos_aldialimucaj_sprova4j swh/lister/maven/tests/data/https_api.github.com/repos_arangodb-community_arangodb-graphql-java swh/lister/maven/tests/data/https_api.github.com/repos_webx_citrus swh/lister/maven/tests/data/https_repo1.maven.org/maven2_al_aldi_sprova4j_0.1.0_sprova4j-0.1.0.pom swh/lister/maven/tests/data/https_repo1.maven.org/maven2_al_aldi_sprova4j_0.1.1_sprova4j-0.1.1.pom swh/lister/maven/tests/data/https_repo1.maven.org/maven2_com_arangodb_arangodb-graphql_1.2_arangodb-graphql-1.2.pom swh/lister/nixguix/__init__.py swh/lister/nixguix/lister.py swh/lister/nixguix/tasks.py swh/lister/nixguix/tests/__init__.py swh/lister/nixguix/tests/test_lister.py swh/lister/nixguix/tests/test_tasks.py swh/lister/nixguix/tests/data/sources-failure.json swh/lister/nixguix/tests/data/sources-success.json swh/lister/npm/__init__.py swh/lister/npm/lister.py swh/lister/npm/tasks.py swh/lister/npm/tests/test_lister.py swh/lister/npm/tests/test_tasks.py swh/lister/npm/tests/data/npm_full_page1.json swh/lister/npm/tests/data/npm_full_page2.json swh/lister/npm/tests/data/npm_incremental_page1.json swh/lister/npm/tests/data/npm_incremental_page2.json swh/lister/nuget/__init__.py swh/lister/nuget/lister.py swh/lister/nuget/tasks.py swh/lister/nuget/tests/__init__.py swh/lister/nuget/tests/test_lister.py swh/lister/nuget/tests/test_tasks.py swh/lister/nuget/tests/data/https_api.nuget.org/v3-flatcontainer_intersoft.crosslight.logging.entityframework_5.0.5000.1235-experimental_intersoft.crosslight.logging.entityframework.nuspec swh/lister/nuget/tests/data/https_api.nuget.org/v3-flatcontainer_sil.core.desktop_10.0.1-beta0012_sil.core.desktop.nuspec swh/lister/nuget/tests/data/https_api.nuget.org/v3_catalog0_data_2022.09.23.08.07.54_sil.core.desktop.10.0.1-beta0012.json swh/lister/nuget/tests/data/https_api.nuget.org/v3_catalog0_data_2022.09.23.09.10.26_intersoft.crosslight.logging.entityframework.5.0.5000.1235-experimental.json swh/lister/nuget/tests/data/https_api.nuget.org/v3_catalog0_index.json swh/lister/nuget/tests/data/https_api.nuget.org/v3_catalog0_page11702.json swh/lister/nuget/tests/data/https_api.nuget.org/v3_catalog0_page16958.json swh/lister/opam/__init__.py swh/lister/opam/lister.py swh/lister/opam/tasks.py swh/lister/opam/tests/__init__.py swh/lister/opam/tests/test_lister.py swh/lister/opam/tests/test_tasks.py swh/lister/opam/tests/data/fake_opam_repo/repo swh/lister/opam/tests/data/fake_opam_repo/version swh/lister/opam/tests/data/fake_opam_repo/packages/agrid/agrid.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.4/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.5/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.6/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/ocb/ocb.0.1/opam swh/lister/packagist/__init__.py swh/lister/packagist/lister.py swh/lister/packagist/tasks.py swh/lister/packagist/tests/__init__.py swh/lister/packagist/tests/test_lister.py swh/lister/packagist/tests/test_tasks.py swh/lister/packagist/tests/data/den1n_contextmenu.json swh/lister/packagist/tests/data/idevlab_essential.json swh/lister/packagist/tests/data/ljjackson_linnworks.json swh/lister/packagist/tests/data/lky_wx_article.json +swh/lister/packagist/tests/data/payrix_payrix-php.json swh/lister/packagist/tests/data/spryker-eco_computop-api.json +swh/lister/packagist/tests/data/with_invalid_url.json swh/lister/packagist/tests/data/ycms_module-main.json swh/lister/packagist/tests/data/https_api.github.com/repos_gitlky_wx_article swh/lister/packagist/tests/data/https_api.github.com/repos_spryker-eco_computop-api swh/lister/packagist/tests/data/https_api.github.com/repos_ycms_module-main swh/lister/phabricator/__init__.py swh/lister/phabricator/lister.py swh/lister/phabricator/tasks.py swh/lister/phabricator/tests/__init__.py swh/lister/phabricator/tests/test_lister.py swh/lister/phabricator/tests/test_tasks.py swh/lister/phabricator/tests/data/__init__.py swh/lister/phabricator/tests/data/phabricator_api_repositories_page1.json swh/lister/phabricator/tests/data/phabricator_api_repositories_page2.json swh/lister/pubdev/__init__.py swh/lister/pubdev/lister.py swh/lister/pubdev/tasks.py swh/lister/pubdev/tests/__init__.py swh/lister/pubdev/tests/test_lister.py swh/lister/pubdev/tests/test_tasks.py swh/lister/pubdev/tests/data/https_pub.dev/api_package-names swh/lister/pubdev/tests/data/https_pub.dev/api_packages_Autolinker swh/lister/pubdev/tests/data/https_pub.dev/api_packages_Babylon swh/lister/puppet/__init__.py swh/lister/puppet/lister.py swh/lister/puppet/tasks.py swh/lister/puppet/tests/__init__.py swh/lister/puppet/tests/test_lister.py swh/lister/puppet/tests/test_tasks.py swh/lister/puppet/tests/data/https_forgeapi.puppet.com/v3_modules,limit=100 swh/lister/puppet/tests/data/https_forgeapi.puppet.com/v3_modules,limit=100,offset=100 swh/lister/pypi/__init__.py swh/lister/pypi/lister.py swh/lister/pypi/tasks.py swh/lister/pypi/tests/__init__.py swh/lister/pypi/tests/test_lister.py swh/lister/pypi/tests/test_tasks.py swh/lister/rubygems/__init__.py swh/lister/rubygems/lister.py swh/lister/rubygems/tasks.py swh/lister/rubygems/tests/__init__.py swh/lister/rubygems/tests/test_lister.py swh/lister/rubygems/tests/test_tasks.py swh/lister/rubygems/tests/data/rubygems_dumps.xml swh/lister/rubygems/tests/data/rubygems_pgsql_dump.tar swh/lister/rubygems/tests/data/small_rubygems_dump.sh swh/lister/sourceforge/__init__.py swh/lister/sourceforge/lister.py swh/lister/sourceforge/tasks.py swh/lister/sourceforge/tests/__init__.py swh/lister/sourceforge/tests/test_lister.py swh/lister/sourceforge/tests/test_tasks.py swh/lister/sourceforge/tests/data/aaron.html swh/lister/sourceforge/tests/data/aaron.json swh/lister/sourceforge/tests/data/adobexmp.json swh/lister/sourceforge/tests/data/backapps-website.json swh/lister/sourceforge/tests/data/backapps.json swh/lister/sourceforge/tests/data/main-sitemap.xml swh/lister/sourceforge/tests/data/mojunk.json swh/lister/sourceforge/tests/data/mramm.json swh/lister/sourceforge/tests/data/ocaml-lpd.html swh/lister/sourceforge/tests/data/ocaml-lpd.json swh/lister/sourceforge/tests/data/os3dmodels.json swh/lister/sourceforge/tests/data/random-mercurial.json swh/lister/sourceforge/tests/data/subsitemap-0.xml swh/lister/sourceforge/tests/data/subsitemap-1.xml swh/lister/sourceforge/tests/data/t12eksandbox.html swh/lister/sourceforge/tests/data/t12eksandbox.json swh/lister/tests/__init__.py swh/lister/tests/test_cli.py swh/lister/tests/test_pattern.py swh/lister/tests/test_utils.py swh/lister/tuleap/__init__.py swh/lister/tuleap/lister.py swh/lister/tuleap/tasks.py swh/lister/tuleap/tests/__init__.py swh/lister/tuleap/tests/test_lister.py swh/lister/tuleap/tests/test_tasks.py swh/lister/tuleap/tests/data/https_tuleap.net/projects swh/lister/tuleap/tests/data/https_tuleap.net/repo_1 swh/lister/tuleap/tests/data/https_tuleap.net/repo_2 swh/lister/tuleap/tests/data/https_tuleap.net/repo_3 \ No newline at end of file diff --git a/swh/lister/maven/tests/data/sprova4j-0.1.0.invalidurl.pom b/swh/lister/maven/tests/data/sprova4j-0.1.0.invalidurl.pom new file mode 100644 index 0000000..28284e6 --- /dev/null +++ b/swh/lister/maven/tests/data/sprova4j-0.1.0.invalidurl.pom @@ -0,0 +1,30 @@ + + + 4.0.0 + al.aldi + sprova4j + 0.1.0 + sprova4j + Java client for Sprova Test Management + https://github.com/aldialimucaj/sprova4j + 2018 + + + The Apache Software License, Version 2.0 + http://www.apache.org/licenses/LICENSE-2.0.txt + repo + + + + + aldi + Aldi Alimucaj + aldi.alimucaj@gmail.com + + + + scm:git@github.com/aldialimucaj/sprova4j.git + git@github.com/aldialimucaj/sprova4j + + + diff --git a/swh/lister/maven/tests/test_lister.py b/swh/lister/maven/tests/test_lister.py index 9bacd4e..18cde65 100644 --- a/swh/lister/maven/tests/test_lister.py +++ b/swh/lister/maven/tests/test_lister.py @@ -1,348 +1,395 @@ # Copyright (C) 2021-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from pathlib import Path import iso8601 import pytest import requests from swh.lister.maven.lister import MavenLister MVN_URL = "https://repo1.maven.org/maven2/" # main maven repo url INDEX_URL = "http://indexes/export.fld" # index directory url URL_POM_1 = MVN_URL + "al/aldi/sprova4j/0.1.0/sprova4j-0.1.0.pom" URL_POM_2 = MVN_URL + "al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom" URL_POM_3 = MVN_URL + "com/arangodb/arangodb-graphql/1.2/arangodb-graphql-1.2.pom" USER_REPO0 = "aldialimucaj/sprova4j" GIT_REPO_URL0_HTTPS = f"https://github.com/{USER_REPO0}" GIT_REPO_URL0_API = f"https://api.github.com/repos/{USER_REPO0}" ORIGIN_GIT = GIT_REPO_URL0_HTTPS USER_REPO1 = "ArangoDB-Community/arangodb-graphql-java" GIT_REPO_URL1_HTTPS = f"https://github.com/{USER_REPO1}" GIT_REPO_URL1_GIT = f"git://github.com/{USER_REPO1}.git" GIT_REPO_URL1_API = f"https://api.github.com/repos/{USER_REPO1}" ORIGIN_GIT_INCR = GIT_REPO_URL1_HTTPS USER_REPO2 = "webx/citrus" GIT_REPO_URL2_HTTPS = f"https://github.com/{USER_REPO2}" GIT_REPO_URL2_API = f"https://api.github.com/repos/{USER_REPO2}" ORIGIN_SRC = MVN_URL + "al/aldi/sprova4j" LIST_SRC_DATA = ( { "type": "maven", "url": "https://repo1.maven.org/maven2/al/aldi/sprova4j" + "/0.1.0/sprova4j-0.1.0-sources.jar", "time": "2021-07-12T17:06:59+00:00", "gid": "al.aldi", "aid": "sprova4j", "version": "0.1.0", "base_url": MVN_URL, }, { "type": "maven", "url": "https://repo1.maven.org/maven2/al/aldi/sprova4j" + "/0.1.1/sprova4j-0.1.1-sources.jar", "time": "2021-07-12T17:37:05+00:00", "gid": "al.aldi", "aid": "sprova4j", "version": "0.1.1", "base_url": MVN_URL, }, ) @pytest.fixture def maven_index_full(datadir) -> bytes: return Path(datadir, "http_indexes", "export_full.fld").read_bytes() @pytest.fixture def maven_index_incr_first(datadir) -> bytes: return Path(datadir, "http_indexes", "export_incr_first.fld").read_bytes() @pytest.fixture def maven_index_null_mtime(datadir) -> bytes: return Path(datadir, "http_indexes", "export_null_mtime.fld").read_bytes() @pytest.fixture(autouse=True) def network_requests_mock(requests_mock, requests_mock_datadir, maven_index_full): requests_mock.get(INDEX_URL, content=maven_index_full) @pytest.fixture(autouse=True) def retry_sleep_mock(mocker): mocker.patch.object(MavenLister.http_request.retry, "sleep") def test_maven_full_listing(swh_scheduler): """Covers full listing of multiple pages, checking page results and listed origins, statelessness.""" # Run the lister. lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=False, ) stats = lister.run() # Start test checks. assert stats.pages == 5 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results origin_urls = [origin.url for origin in scheduler_origins] # 3 git origins + 1 maven origin with 2 releases (one per jar) assert set(origin_urls) == {ORIGIN_GIT, ORIGIN_GIT_INCR, ORIGIN_SRC} assert len(set(origin_urls)) == len(origin_urls) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: last_update_src = iso8601.parse_date(src["time"]) assert last_update_src <= origin.last_update assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == -1 assert scheduler_state.last_seen_pom == -1 def test_maven_full_listing_malformed( swh_scheduler, requests_mock, datadir, ): """Covers full listing of multiple pages, checking page results with a malformed scm entry in pom.""" lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=False, ) # Set up test. requests_mock.get( URL_POM_1, content=Path(datadir, "sprova4j-0.1.0.malformed.pom").read_bytes() ) # Then run the lister. stats = lister.run() # Start test checks. assert stats.pages == 5 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results origin_urls = [origin.url for origin in scheduler_origins] # 2 git origins + 1 maven origin with 2 releases (one per jar) assert set(origin_urls) == {ORIGIN_GIT, ORIGIN_GIT_INCR, ORIGIN_SRC} assert len(origin_urls) == len(set(origin_urls)) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: last_update_src = iso8601.parse_date(src["time"]) assert last_update_src <= origin.last_update assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == -1 assert scheduler_state.last_seen_pom == -1 +def test_maven_ignore_invalid_url( + swh_scheduler, + requests_mock, + datadir, +): + """Covers full listing of multiple pages, checking page results with a malformed + scm entry in pom.""" + + lister = MavenLister( + scheduler=swh_scheduler, + url=MVN_URL, + instance="maven.org", + index_url=INDEX_URL, + incremental=False, + ) + + # Set up test. + requests_mock.get( + URL_POM_1, content=Path(datadir, "sprova4j-0.1.0.invalidurl.pom").read_bytes() + ) + + # Then run the lister. + stats = lister.run() + + # Start test checks. + assert stats.pages == 5 + + scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results + origin_urls = [origin.url for origin in scheduler_origins] + + # 1 git origins (the other ignored) + 1 maven origin with 2 releases (one per jar) + assert set(origin_urls) == {ORIGIN_GIT_INCR, ORIGIN_SRC} + assert len(origin_urls) == len(set(origin_urls)) + + for origin in scheduler_origins: + if origin.visit_type == "maven": + for src in LIST_SRC_DATA: + last_update_src = iso8601.parse_date(src["time"]) + assert last_update_src <= origin.last_update + assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) + + scheduler_state = lister.get_state_from_scheduler() + assert scheduler_state is not None + assert scheduler_state.last_seen_doc == -1 + assert scheduler_state.last_seen_pom == -1 + + def test_maven_incremental_listing( swh_scheduler, requests_mock, maven_index_full, maven_index_incr_first, ): """Covers full listing of multiple pages, checking page results and listed origins, with a second updated run for statefulness.""" lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=True, ) # Set up test. requests_mock.get(INDEX_URL, content=maven_index_incr_first) # Then run the lister. stats = lister.run() # Start test checks. assert lister.incremental assert lister.updated assert stats.pages == 2 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results origin_urls = [origin.url for origin in scheduler_origins] # 1 git origins + 1 maven origin with 1 release (one per jar) assert set(origin_urls) == {ORIGIN_GIT, ORIGIN_SRC} assert len(origin_urls) == len(set(origin_urls)) for origin in scheduler_origins: if origin.visit_type == "maven": last_update_src = iso8601.parse_date(LIST_SRC_DATA[0]["time"]) assert last_update_src == origin.last_update assert origin.extra_loader_arguments["artifacts"] == [LIST_SRC_DATA[0]] # Second execution of the lister, incremental mode lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=True, ) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == 1 assert scheduler_state.last_seen_pom == 1 # Set up test. requests_mock.get(INDEX_URL, content=maven_index_full) # Then run the lister. stats = lister.run() # Start test checks. assert lister.incremental assert lister.updated assert stats.pages == 4 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results origin_urls = [origin.url for origin in scheduler_origins] assert set(origin_urls) == {ORIGIN_SRC, ORIGIN_GIT, ORIGIN_GIT_INCR} assert len(origin_urls) == len(set(origin_urls)) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: last_update_src = iso8601.parse_date(src["time"]) assert last_update_src <= origin.last_update assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == 4 assert scheduler_state.last_seen_pom == 4 @pytest.mark.parametrize("http_code", [400, 404, 500, 502]) def test_maven_list_http_error_on_index_read(swh_scheduler, requests_mock, http_code): """should stop listing if the lister fails to retrieve the main index url.""" lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) requests_mock.get(INDEX_URL, status_code=http_code) with pytest.raises(requests.HTTPError): # listing cannot continues so stop lister.run() scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 0 @pytest.mark.parametrize("http_code", [400, 404, 500, 502]) def test_maven_list_http_error_artifacts( swh_scheduler, requests_mock, http_code, ): """should continue listing when failing to retrieve artifacts.""" # Test failure of artefacts retrieval. requests_mock.get(URL_POM_1, status_code=http_code) lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) # on artifacts though, that raises but continue listing lister.run() # If the maven_index_full step succeeded but not the get_pom step, # then we get only one maven-jar origin and one git origin. scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results origin_urls = [origin.url for origin in scheduler_origins] assert set(origin_urls) == {ORIGIN_SRC, ORIGIN_GIT_INCR} assert len(origin_urls) == len(set(origin_urls)) def test_maven_lister_null_mtime(swh_scheduler, requests_mock, maven_index_null_mtime): requests_mock.get(INDEX_URL, content=maven_index_null_mtime) # Run the lister. lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=False, ) stats = lister.run() # Start test checks. assert stats.pages == 1 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 1 assert scheduler_origins[0].last_update is None def test_maven_list_pom_bad_encoding(swh_scheduler, requests_mock): """should continue listing when failing to decode pom file.""" # Test failure of pom parsing by reencoding a UTF-8 pom file to a not expected one requests_mock.get( URL_POM_1, content=requests.get(URL_POM_1).content.decode("utf-8").encode("utf-32"), ) lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) lister.run() # If the maven_index_full step succeeded but not the pom parsing step, # then we get only one maven-jar origin and one git origin. scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 2 def test_maven_list_pom_multi_byte_encoding(swh_scheduler, requests_mock, datadir): """should parse POM file with multi-byte encoding.""" # replace pom file with a multi-byte encoding one requests_mock.get( URL_POM_1, content=Path(datadir, "citrus-parent-3.0.7.pom").read_bytes() ) lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) lister.run() scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 3 diff --git a/swh/lister/nixguix/lister.py b/swh/lister/nixguix/lister.py index 9ebe82e..3e410aa 100644 --- a/swh/lister/nixguix/lister.py +++ b/swh/lister/nixguix/lister.py @@ -1,566 +1,566 @@ # Copyright (C) 2020-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """NixGuix lister definition. This lists artifacts out of manifest for Guix or Nixpkgs manifests. Artifacts can be of types: - upstream git repository (NixOS/nixpkgs, Guix) - VCS repositories (svn, git, hg, ...) - unique file - unique tarball """ import base64 import binascii from dataclasses import dataclass from enum import Enum import logging from pathlib import Path import random import re from typing import Any, Dict, Iterator, List, Optional, Tuple, Union from urllib.parse import parse_qsl, urlparse import requests from requests.exceptions import ConnectionError, InvalidSchema, SSLError from swh.core.tarball import MIMETYPE_TO_ARCHIVE_FORMAT from swh.lister import TARBALL_EXTENSIONS from swh.lister.pattern import CredentialsType, StatelessLister from swh.scheduler.model import ListedOrigin logger = logging.getLogger(__name__) # By default, ignore binary files and archives containing binaries DEFAULT_EXTENSIONS_TO_IGNORE = [ "AppImage", "bin", "exe", "iso", "linux64", "msi", "png", "dic", "deb", "rpm", ] class ArtifactNatureUndetected(ValueError): """Raised when a remote artifact's nature (tarball, file) cannot be detected.""" pass class ArtifactNatureMistyped(ValueError): """Raised when a remote artifact is neither a tarball nor a file. Error of this type are' probably a misconfiguration in the manifest generation that badly typed a vcs repository. """ pass class ArtifactWithoutExtension(ValueError): """Raised when an artifact nature cannot be determined by its name.""" pass class ChecksumsComputation(Enum): """The possible artifact types listed out of the manifest.""" STANDARD = "standard" """Standard checksums (e.g. sha1, sha256, ...) on the tarball or file.""" NAR = "nar" """The hash is computed over the NAR archive dump of the output (e.g. uncompressed directory.)""" MAPPING_CHECKSUMS_COMPUTATION = { "flat": ChecksumsComputation.STANDARD, "recursive": ChecksumsComputation.NAR, } """Mapping between the outputHashMode from the manifest and how to compute checksums.""" @dataclass class Artifact: """Metadata information on Remote Artifact with url (tarball or file).""" origin: str """Canonical url retrieve the tarball artifact.""" visit_type: str """Either 'tar' or 'file' """ fallback_urls: List[str] """List of urls to retrieve tarball artifact if canonical url no longer works.""" checksums: Dict[str, str] """Integrity hash converted into a checksum dict.""" checksums_computation: ChecksumsComputation """Checksums computation mode to provide to loaders (e.g. nar, standard, ...)""" @dataclass class VCS: """Metadata information on VCS.""" origin: str """Origin url of the vcs""" type: str """Type of (d)vcs, e.g. svn, git, hg, ...""" ref: Optional[str] = None """Reference either a svn commit id, a git commit, ...""" class ArtifactType(Enum): """The possible artifact types listed out of the manifest.""" ARTIFACT = "artifact" VCS = "vcs" PageResult = Tuple[ArtifactType, Union[Artifact, VCS]] VCS_SUPPORTED = ("git", "svn", "hg") # Rough approximation of what we can find of mimetypes for tarballs "out there" POSSIBLE_TARBALL_MIMETYPES = tuple(MIMETYPE_TO_ARCHIVE_FORMAT.keys()) PATTERN_VERSION = re.compile(r"(v*[0-9]+[.])([0-9]+[.]*)+") def url_endswith( urlparsed, extensions: List[str], raise_when_no_extension: bool = True ) -> bool: """Determine whether urlparsed ends with one of the extensions passed as parameter. This also account for the edge case of a filename with only a version as name (so no extension in the end.) Raises: ArtifactWithoutExtension in case no extension is available and raise_when_no_extension is True (the default) """ paths = [Path(p) for (_, p) in [("_", urlparsed.path)] + parse_qsl(urlparsed.query)] if raise_when_no_extension and not any(path.suffix != "" for path in paths): raise ArtifactWithoutExtension match = any(path.suffix.endswith(tuple(extensions)) for path in paths) if match: return match # Some false negative can happen (e.g. https:///path/0.1.5)), so make sure # to catch those name = Path(urlparsed.path).name if not PATTERN_VERSION.match(name): return match if raise_when_no_extension: raise ArtifactWithoutExtension return False def is_tarball(urls: List[str], request: Optional[Any] = None) -> Tuple[bool, str]: """Determine whether a list of files actually are tarballs or simple files. When this cannot be answered simply out of the url, when request is provided, this executes a HTTP `HEAD` query on the url to determine the information. If request is not provided, this raises an ArtifactNatureUndetected exception. Args: urls: name of the remote files for which the extension needs to be checked. Raises: ArtifactNatureUndetected when the artifact's nature cannot be detected out of its url ArtifactNatureMistyped when the artifact is not a tarball nor a file. It's up to the caller to do what's right with it. Returns: A tuple (bool, url). The boolean represents whether the url is an archive or not. The second parameter is the actual url once the head request is issued as a fallback of not finding out whether the urls are tarballs or not. """ def _is_tarball(url): """Determine out of an extension whether url is a tarball. Raises: ArtifactWithoutExtension in case no extension is available """ urlparsed = urlparse(url) if urlparsed.scheme not in ("http", "https", "ftp"): raise ArtifactNatureMistyped(f"Mistyped artifact '{url}'") return url_endswith(urlparsed, TARBALL_EXTENSIONS) index = random.randrange(len(urls)) url = urls[index] try: return _is_tarball(url), urls[0] except ArtifactWithoutExtension: if request is None: raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) logger.warning( "Cannot detect extension for <%s>. Fallback to http head query", url, ) try: response = request.head(url) except (InvalidSchema, SSLError, ConnectionError): raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) if not response.ok or response.status_code == 404: raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) location = response.headers.get("Location") if location: # It's not always present logger.debug("Location: %s", location) try: # FIXME: location is also returned as it's considered the true origin, # true enough? return _is_tarball(location), location except ArtifactWithoutExtension: logger.warning( "Still cannot detect extension through location <%s>...", url, ) origin = urls[0] content_type = response.headers.get("Content-Type") if content_type: logger.debug("Content-Type: %s", content_type) if content_type == "application/json": return False, origin return content_type.startswith(POSSIBLE_TARBALL_MIMETYPES), origin content_disposition = response.headers.get("Content-Disposition") if content_disposition: logger.debug("Content-Disposition: %s", content_disposition) if "filename=" in content_disposition: fields = content_disposition.split("; ") for field in fields: if "filename=" in field: _, filename = field.split("filename=") break return ( url_endswith( urlparse(filename), TARBALL_EXTENSIONS, raise_when_no_extension=False, ), origin, ) raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) VCS_KEYS_MAPPING = { "git": { "ref": "git_ref", "url": "git_url", }, "svn": { "ref": "svn_revision", "url": "svn_url", }, "hg": { "ref": "hg_changeset", "url": "hg_url", }, } class NixGuixLister(StatelessLister[PageResult]): """List Guix or Nix sources out of a public json manifest. This lister can output: - unique tarball (.tar.gz, .tbz2, ...) - vcs repositories (e.g. git, hg, svn) - unique file (.lisp, .py, ...) Note that no `last_update` is available in either manifest. For `url` types artifacts, this tries to determine the artifact's nature, tarball or file. It first tries to compute out of the "url" extension. In case of no extension, it fallbacks to query (HEAD) the url to retrieve the origin out of the `Location` response header, and then checks the extension again. Optionally, when the `extension_to_ignore` parameter is provided, it extends the default extensions to ignore (`DEFAULT_EXTENSIONS_TO_IGNORE`) with those passed. This can be used to drop further binary files detected in the wild. """ LISTER_NAME = "nixguix" def __init__( self, scheduler, url: str, origin_upstream: str, instance: Optional[str] = None, credentials: Optional[CredentialsType] = None, # canonicalize urls, can be turned off during docker runs canonicalize: bool = True, extensions_to_ignore: List[str] = [], **kwargs: Any, ): super().__init__( scheduler=scheduler, url=url.rstrip("/"), instance=instance, credentials=credentials, with_github_session=canonicalize, ) # either full fqdn NixOS/nixpkgs or guix repository urls # maybe add an assert on those specific urls? self.origin_upstream = origin_upstream self.extensions_to_ignore = DEFAULT_EXTENSIONS_TO_IGNORE + extensions_to_ignore self.session = requests.Session() def build_artifact( self, artifact_url: str, artifact_type: str, artifact_ref: Optional[str] = None ) -> Optional[Tuple[ArtifactType, VCS]]: """Build a canonicalized vcs artifact when possible.""" origin = ( self.github_session.get_canonical_url(artifact_url) if self.github_session else artifact_url ) if not origin: return None return ArtifactType.VCS, VCS( origin=origin, type=artifact_type, ref=artifact_ref ) def get_pages(self) -> Iterator[PageResult]: """Yield one page per "typed" origin referenced in manifest.""" # fetch and parse the manifest... response = self.http_request(self.url) # ... if any raw_data = response.json() yield ArtifactType.VCS, VCS(origin=self.origin_upstream, type="git") # grep '"type"' guix-sources.json | sort | uniq # "type": false <<<<<<<<< noise # "type": "git", # "type": "hg", # "type": "no-origin", <<<<<<<<< noise # "type": "svn", # "type": "url", # grep '"type"' nixpkgs-sources-unstable.json | sort | uniq # "type": "url", sources = raw_data["sources"] random.shuffle(sources) for artifact in sources: artifact_type = artifact["type"] if artifact_type in VCS_SUPPORTED: plain_url = artifact[VCS_KEYS_MAPPING[artifact_type]["url"]] plain_ref = artifact[VCS_KEYS_MAPPING[artifact_type]["ref"]] built_artifact = self.build_artifact( plain_url, artifact_type, plain_ref ) if not built_artifact: continue yield built_artifact elif artifact_type == "url": # It's either a tarball or a file origin_urls = artifact.get("urls") if not origin_urls: # Nothing to fetch logger.warning("Skipping url <%s>: empty artifact", artifact) continue assert origin_urls is not None # Deal with urls with empty scheme (basic fallback to http) urls = [] for url in origin_urls: urlparsed = urlparse(url) - if urlparsed.scheme == "": + if urlparsed.scheme == "" and not re.match(r"^\w+@[^/]+:", url): logger.warning("Missing scheme for <%s>: fallback to http", url) fixed_url = f"http://{url}" else: fixed_url = url urls.append(fixed_url) origin, *fallback_urls = urls if origin.endswith(".git"): built_artifact = self.build_artifact(origin, "git") if not built_artifact: continue yield built_artifact continue outputHash = artifact.get("outputHash") integrity = artifact.get("integrity") if integrity is None and outputHash is None: logger.warning( "Skipping url <%s>: missing integrity and outputHash field", origin, ) continue # Falls back to outputHash field if integrity is missing if integrity is None and outputHash: # We'll deal with outputHash as integrity field integrity = outputHash try: is_tar, origin = is_tarball(urls, self.session) except ArtifactNatureMistyped: logger.warning( "Mistyped url <%s>: trying to deal with it properly", origin ) urlparsed = urlparse(origin) artifact_type = urlparsed.scheme if artifact_type in VCS_SUPPORTED: built_artifact = self.build_artifact(origin, artifact_type) if not built_artifact: continue yield built_artifact else: logger.warning( "Skipping url <%s>: undetected remote artifact type", origin ) continue except ArtifactNatureUndetected: logger.warning( "Skipping url <%s>: undetected remote artifact type", origin ) continue # Determine the content checksum stored in the integrity field and # convert into a dict of checksums. This only parses the # `hash-expression` (hash-) as defined in # https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute try: chksum_algo, chksum_b64 = integrity.split("-") checksums: Dict[str, str] = { chksum_algo: base64.decodebytes(chksum_b64.encode()).hex() } except binascii.Error: logger.exception( "Skipping url: <%s>: integrity computation failure for <%s>", url, artifact, ) continue # The 'outputHashMode' attribute determines how the hash is computed. It # must be one of the following two values: # - "flat": (default) The output must be a non-executable regular file. # If it isn’t, the build fails. The hash is simply computed over the # contents of that file (so it’s equal to what Unix commands like # `sha256sum` or `sha1sum` produce). # - "recursive": The hash is computed over the NAR archive dump of the # output (i.e., the result of `nix-store --dump`). In this case, # the output can be anything, including a directory tree. outputHashMode = artifact.get("outputHashMode", "flat") if not is_tar and outputHashMode == "recursive": # T4608: Cannot deal with those properly yet as some can be missing # 'critical' information about how to recompute the hash (e.g. fs # layout, executable bit, ...) logger.warning( "Skipping artifact <%s>: 'file' artifact of type <%s> is" " missing information to properly check its integrity", artifact, artifact_type, ) continue # At this point plenty of heuristics happened and we should have found # the right origin and its nature. # Let's check and filter it out if it is to be ignored (if possible). # Some origin urls may not have extension at this point (e.g # http://git.marmaro.de/?p=mmh;a=snp;h=;sf=tgz), let them through. if url_endswith( urlparse(origin), self.extensions_to_ignore, raise_when_no_extension=False, ): logger.warning( "Skipping artifact <%s>: 'file' artifact of type <%s> is" " ignored due to lister configuration. It should ignore" " origins with extension [%s]", origin, artifact_type, ",".join(self.extensions_to_ignore), ) continue logger.debug("%s: %s", "dir" if is_tar else "cnt", origin) yield ArtifactType.ARTIFACT, Artifact( origin=origin, fallback_urls=fallback_urls, checksums=checksums, checksums_computation=MAPPING_CHECKSUMS_COMPUTATION[outputHashMode], visit_type="directory" if is_tar else "content", ) else: logger.warning( "Skipping artifact <%s>: unsupported type %s", artifact, artifact_type, ) def vcs_to_listed_origin(self, artifact: VCS) -> Iterator[ListedOrigin]: """Given a vcs repository, yield a ListedOrigin.""" assert self.lister_obj.id is not None # FIXME: What to do with the "ref" (e.g. git/hg/svn commit, ...) yield ListedOrigin( lister_id=self.lister_obj.id, url=artifact.origin, visit_type=artifact.type, ) def artifact_to_listed_origin(self, artifact: Artifact) -> Iterator[ListedOrigin]: """Given an artifact (tarball, file), yield one ListedOrigin.""" assert self.lister_obj.id is not None yield ListedOrigin( lister_id=self.lister_obj.id, url=artifact.origin, visit_type=artifact.visit_type, extra_loader_arguments={ "checksums": artifact.checksums, "checksums_computation": artifact.checksums_computation.value, "fallback_urls": artifact.fallback_urls, }, ) def get_origins_from_page( self, artifact_tuple: PageResult ) -> Iterator[ListedOrigin]: """Given an artifact tuple (type, artifact), yield a ListedOrigin.""" artifact_type, artifact = artifact_tuple mapping_type_fn = getattr(self, f"{artifact_type.value}_to_listed_origin") yield from mapping_type_fn(artifact) diff --git a/swh/lister/nixguix/tests/data/sources-failure.json b/swh/lister/nixguix/tests/data/sources-failure.json index 237a018..86b34a8 100644 --- a/swh/lister/nixguix/tests/data/sources-failure.json +++ b/swh/lister/nixguix/tests/data/sources-failure.json @@ -1,181 +1,191 @@ { "sources": [ {"type": "git", "git_url": "", "git_ref": ""}, {"type": false}, {"type": "no-origin"}, {"type": "url", "urls": []}, { "type": "url", "urls": ["https://crates.io/api/v1/0.1.5/no-extension-and-head-404-so-skipped"], "integrity": "sha256-HW6jxFlbljY8E5Q0l9s0r0Rg+0dKlcQ/REatNBuMl4U=" }, { "type": "url", "urls": [ "https://example.org/another-file-no-integrity-so-skipped.txt" ] }, { "type": "url", "urls": [ "ftp://ftp.ourproject.org/file-with-no-extension" ], "integrity": "sha256-bss09x9yOnuW+Q5BHHjf8nNcCNxCKMdl9/2/jKSFcrQ=" }, { "type": "url", "urls": [ "https://git-tails.immerda.ch/onioncircuits" ], "integrity": "sha256-lV3xiWUZmSnt4LW0ni/sUyC/bbtaxkTzvFLFtJKLuI4=" }, { "outputHash": "sha256-9uF0fYl4Zz/Ia2UKx7CBi8ZU8jfWoBfy2QSgTSwXo5A", "outputHashAlgo": null, "outputHashMode": "recursive", "type": "url", "urls": [ "https://github.com/figiel/hosts/archive/v1.0.0.tar.gz" ], "inferredFetcher": "fetchzip" }, { "outputHash": "0s2mvy1nr2v1x0rr1fxlsv8ly1vyf9978rb4hwry5vnr678ls522", "outputHashAlgo": "sha256", "outputHashMode": "recursive", "type": "url", "urls": [ "https://www.unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt" ], "integrity": "sha256-QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg=", "inferredFetcher": "unclassified" }, { "type": "url", "urls": [ "unknown://example.org/wrong-scheme-so-skipped.txt" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" }, + { + "type": "url", + "urls": [ "ssh://git@example.org:wrong-scheme-so-skipped.txt" ], + "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" + }, + { + "type": "url", + "urls": [ "git@example.org:git-pseudourl/so-skipped" ], + "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" + }, { "type": "url", "urls": [ "https://code.9front.org/hg/plan9front" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" }, { "outputHash": "sha256-IgPqUEDpaIuGoaGoH2GCEzh3KxF3pkJC3VjTYXwSiQE=", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://github.com/KSP-CKAN/CKAN/releases/download/v1.30.4/ckan.exe" ], "integrity": "sha256-IgPqUEDpaIuGoaGoH2GCEzh3KxF3pkJC3VjTYXwSiQE=", "inferredFetcher": "unclassified" }, { "outputHash": "sha256-ezJN/t0iNk0haMLPioEQSNXU4ugVeJe44GNVGd+cOF4=", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://github.com/johannesjo/super-productivity/releases/download/v7.5.1/superProductivity-7.5.1.AppImage" ], "integrity": "sha256-ezJN/t0iNk0haMLPioEQSNXU4ugVeJe44GNVGd+cOF4=", "inferredFetcher": "unclassified" }, { "outputHash": "19ir6x4c01825hpx2wbbcxkk70ymwbw4j03v8b2xc13ayylwzx0r", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "http://gorilla.dp100.com/downloads/gorilla1537_64.bin" ], "integrity": "sha256-GfTPqfdqBNbFQnsASfji1YMzZ2drcdEvLAIFwEg3OaY=", "inferredFetcher": "unclassified" }, { "outputHash": "1zj53xybygps66m3v5kzi61vqy987zp6bfgk0qin9pja68qq75vx", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.196-1/virtio-win.iso" ], "integrity": "sha256-fZeDMTJK3mQjBvO5Ze4/KHm8g4l/lj2qMfo+v3wfRf4=", "inferredFetcher": "unclassified" }, { "outputHash": "02qgsj4h4zrjxkcclx7clsqbqd699kg0dq1xxa9hbj3vfnddjv1f", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://www.pjrc.com/teensy/td_153/TeensyduinoInstall.linux64" ], "integrity": "sha256-LmzZmnV7yAWT6j3gBt5MyTS8sKbsdMrY7DJ/AonUDws=", "inferredFetcher": "unclassified" }, { "outputHash": "sha256-24uF87kQWQ9hrb+gAFqZXWE+KZocxz0AVT1w3IEBDjY=", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://dl.winehq.org/wine/wine-mono/6.4.0/wine-mono-6.4.0-x86.msi" ], "integrity": "sha256-24uF87kQWQ9hrb+gAFqZXWE+KZocxz0AVT1w3IEBDjY=", "inferredFetcher": "unclassified" }, { "outputHash": "00y96w9shbbrdbf6xcjlahqd08154kkrxmqraik7qshiwcqpw7p4", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://raw.githubusercontent.com/webtorrent/webtorrent-desktop/v0.21.0/static/linux/share/icons/hicolor/48x48/apps/webtorrent-desktop.png" ], "integrity": "sha256-5B5+MeMRanxmVBnXnuckJSDQMFRUsm7canktqBM3yQM=", "inferredFetcher": "unclassified" }, { "outputHash": "0lw193jr7ldvln5x5z9p21rz1by46h0say9whfcw2kxs9vprd5b3", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "http://xuxen.eus/static/hunspell/eu_ES.dic" ], "integrity": "sha256-Y5WW7066T8GZgzx5pQE0xK/wcxA3/dKLpbvRk+VIgVM=", "inferredFetcher": "unclassified" }, { "outputHash": "0wbhvypdr96a5ddg6kj41dn9sbl49n7pfi2vs762ij82hm2gvwcm", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://www.openprinting.org/download/printdriver/components/lsb3.2/main/RPMS/noarch/openprinting-ppds-postscript-lexmark-20160218-1lsb3.2.noarch.rpm" ], "integrity": "sha256-lfH9RIUCySjM0VtEd49NhC6dbAtETvNaK8qk3K7fcHE=", "inferredFetcher": "unclassified" }, { "outputHash": "01gy84gr0gw5ap7hpy72azaf6hlzac7vxkn5cgad5sfbyzxgjgc9", "outputHashAlgo": "sha256", "outputHashMode": "flat", "type": "url", "urls": [ "https://wire-app.wire.com/linux/debian/pool/main/Wire-3.26.2941_amd64.deb" ], "integrity": "sha256-iT35+vfL6dLUY8XOvg9Tn0Lj1Ffi+AvPVYU/kB9B/gU=", "inferredFetcher": "unclassified" }, { "type": "url", "urls": [ "https://elpa.gnu.org/packages/zones.foobar" ], "integrity": "sha256-YRZc7dI3DjUzoSIp4fIshUyhMXIQ/fPKaKnjeYVa4WI=" } ], "version":"1", "revision":"ab59155c5a38dda7efaceb47c7528578fcf0def4" } diff --git a/swh/lister/nixguix/tests/test_lister.py b/swh/lister/nixguix/tests/test_lister.py index fdb7210..a00a5f6 100644 --- a/swh/lister/nixguix/tests/test_lister.py +++ b/swh/lister/nixguix/tests/test_lister.py @@ -1,381 +1,388 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from collections import defaultdict import json import logging from pathlib import Path from typing import Dict, List from urllib.parse import urlparse import pytest import requests from requests.exceptions import ConnectionError, InvalidSchema, SSLError from swh.lister import TARBALL_EXTENSIONS from swh.lister.nixguix.lister import ( DEFAULT_EXTENSIONS_TO_IGNORE, POSSIBLE_TARBALL_MIMETYPES, ArtifactNatureMistyped, ArtifactNatureUndetected, ArtifactWithoutExtension, NixGuixLister, is_tarball, url_endswith, ) from swh.lister.pattern import ListerStats logger = logging.getLogger(__name__) SOURCES = { "guix": { "repo": "https://git.savannah.gnu.org/cgit/guix.git/", "manifest": "https://guix.gnu.org/sources.json", }, "nixpkgs": { "repo": "https://github.com/NixOS/nixpkgs", "manifest": "https://nix-community.github.io/nixpkgs-swh/sources-unstable.json", }, } def page_response(datadir, instance: str = "success") -> List[Dict]: """Return list of repositories (out of test dataset)""" datapath = Path(datadir, f"sources-{instance}.json") return json.loads(datapath.read_text()) if datapath.exists else [] @pytest.mark.parametrize( "name,expected_result", [(f"one.{ext}", True) for ext in TARBALL_EXTENSIONS] + [(f"one.{ext}?foo=bar", True) for ext in TARBALL_EXTENSIONS] + [(f"one?p0=1&foo=bar.{ext}", True) for ext in DEFAULT_EXTENSIONS_TO_IGNORE] + [ ("two?file=something.el", False), ("foo?two=two&three=three", False), ("v1.2.3", False), # with raise_when_no_extension is False ("2048-game-20151026.1233", False), ("v2048-game-20151026.1233", False), ], ) def test_url_endswith(name, expected_result): """It should detect whether url or query params of the urls ends with extensions""" urlparsed = urlparse(f"https://example.org/{name}") assert ( url_endswith( urlparsed, TARBALL_EXTENSIONS + DEFAULT_EXTENSIONS_TO_IGNORE, raise_when_no_extension=False, ) is expected_result ) @pytest.mark.parametrize( "name", ["foo?two=two&three=three", "tar.gz/0.1.5", "tar.gz/v10.3.1"] ) def test_url_endswith_raise(name): """It should raise when the tested url has no extension""" urlparsed = urlparse(f"https://example.org/{name}") with pytest.raises(ArtifactWithoutExtension): url_endswith(urlparsed, ["unimportant"]) @pytest.mark.parametrize( "tarballs", [[f"one.{ext}", f"two.{ext}"] for ext in TARBALL_EXTENSIONS] + [[f"one.{ext}?foo=bar"] for ext in TARBALL_EXTENSIONS], ) def test_is_tarball_simple(tarballs): """Simple check on tarball should discriminate between tarball and file""" urls = [f"https://example.org/{tarball}" for tarball in tarballs] is_tar, origin = is_tarball(urls) assert is_tar is True assert origin == urls[0] @pytest.mark.parametrize( "query_param", ["file", "f", "url", "name", "anykeyreally"], ) def test_is_tarball_not_so_simple(query_param): """More involved check on tarball should discriminate between tarball and file""" url = f"https://example.org/download.php?foo=bar&{query_param}=one.tar.gz" is_tar, origin = is_tarball([url]) assert is_tar is True assert origin == url @pytest.mark.parametrize( "files", [ ["abc.lisp"], ["one.abc", "two.bcd"], ["abc.c", "other.c"], ["one.scm?foo=bar", "two.scm?foo=bar"], ["config.nix", "flakes.nix"], ], ) def test_is_tarball_simple_not_tarball(files): """Simple check on tarball should discriminate between tarball and file""" urls = [f"http://example.org/{file}" for file in files] is_tar, origin = is_tarball(urls) assert is_tar is False assert origin == urls[0] def test_is_tarball_complex_with_no_result(requests_mock): """Complex tarball detection without proper information should fail.""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] with pytest.raises(ArtifactNatureUndetected): is_tarball(urls) # no request parameter, this cannot fallback, raises with pytest.raises(ArtifactNatureUndetected): requests_mock.head( url, status_code=404, # not found so cannot detect anything ) is_tarball(urls, requests) with pytest.raises(ArtifactNatureUndetected): requests_mock.head( url, headers={} ) # response ok without headers, cannot detect anything is_tarball(urls, requests) with pytest.raises(ArtifactNatureUndetected): fallback_url = "https://example.org/mirror/crates/package/download" requests_mock.head( url, headers={"location": fallback_url} # still no extension, cannot detect ) is_tarball(urls, requests) with pytest.raises(ArtifactNatureMistyped): is_tarball(["foo://example.org/unsupported-scheme"]) with pytest.raises(ArtifactNatureMistyped): fallback_url = "foo://example.org/unsupported-scheme" requests_mock.head( url, headers={"location": fallback_url} # still no extension, cannot detect ) is_tarball(urls, requests) @pytest.mark.parametrize( "fallback_url, expected_result", [ ("https://example.org/mirror/crates/package/download.tar.gz", True), ("https://example.org/mirror/package/download.lisp", False), ], ) def test_is_tarball_complex_with_location_result( requests_mock, fallback_url, expected_result ): """Complex tarball detection with information should detect artifact nature""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] # One scenario where the url renders a location with a proper extension requests_mock.head(url, headers={"location": fallback_url}) is_tar, origin = is_tarball(urls, requests) assert is_tar == expected_result if is_tar: assert origin == fallback_url @pytest.mark.parametrize( "content_type, expected_result", [("application/json", False), ("application/something", False)] + [(ext, True) for ext in POSSIBLE_TARBALL_MIMETYPES], ) def test_is_tarball_complex_with_content_type_result( requests_mock, content_type, expected_result ): """Complex tarball detection with information should detect artifact nature""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] # One scenario where the url renders a location with a proper extension requests_mock.head(url, headers={"Content-Type": content_type}) is_tar, origin = is_tarball(urls, requests) assert is_tar == expected_result if is_tar: assert origin == url def test_lister_nixguix_ok(datadir, swh_scheduler, requests_mock): """NixGuixLister should list all origins per visit type""" url = SOURCES["guix"]["manifest"] origin_upstream = SOURCES["guix"]["repo"] lister = NixGuixLister(swh_scheduler, url=url, origin_upstream=origin_upstream) response = page_response(datadir, "success") requests_mock.get( url, [{"json": response}], ) requests_mock.get( "https://api.github.com/repos/trie/trie", [{"json": {"html_url": "https://github.com/trie/trie.git"}}], ) requests_mock.head( "http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz", headers={"Content-Type": "application/gzip; charset=ISO-8859-1"}, ) requests_mock.head( "https://crates.io/api/v1/crates/syntect/4.6.0/download", headers={ "Location": "https://static.crates.io/crates/syntect/syntect-4.6.0.crate" }, ) requests_mock.head( "https://codeload.github.com/fifengine/fifechan/tar.gz/0.1.5", headers={ "Content-Type": "application/x-gzip", }, ) requests_mock.head( "https://codeload.github.com/unknown-horizons/unknown-horizons/tar.gz/2019.1", headers={ "Content-Disposition": "attachment; filename=unknown-horizons-2019.1.tar.gz", }, ) requests_mock.head( "https://codeload.github.com/fifengine/fifengine/tar.gz/0.4.2", headers={ "Content-Disposition": "attachment; name=fieldName; " "filename=fifengine-0.4.2.tar.gz; other=stuff", }, ) expected_visit_types = defaultdict(int) # origin upstream is added as origin expected_nb_origins = 1 expected_visit_types["git"] += 1 for artifact in response["sources"]: # Each artifact is considered an origin (even "url" artifacts with mirror urls) expected_nb_origins += 1 artifact_type = artifact["type"] if artifact_type in [ "git", "svn", "hg", ]: expected_visit_types[artifact_type] += 1 elif artifact_type == "url": url = artifact["urls"][0] if url.endswith(".git"): expected_visit_types["git"] += 1 elif url.endswith(".c") or url.endswith(".txt"): expected_visit_types["content"] += 1 elif url.startswith("svn"): # mistyped artifact rendered as vcs nonetheless expected_visit_types["svn"] += 1 elif "crates.io" in url or "codeload.github.com" in url: expected_visit_types["directory"] += 1 else: # tarball artifacts expected_visit_types["directory"] += 1 assert set(expected_visit_types.keys()) == { "content", "git", "svn", "hg", "directory", } listed_result = lister.run() # 1 page read is 1 origin nb_pages = expected_nb_origins assert listed_result == ListerStats(pages=nb_pages, origins=expected_nb_origins) scheduler_origins = lister.scheduler.get_listed_origins( lister.lister_obj.id ).results assert len(scheduler_origins) == expected_nb_origins mapping_visit_types = defaultdict(int) for listed_origin in scheduler_origins: assert listed_origin.visit_type in expected_visit_types # no last update is listed on those manifests assert listed_origin.last_update is None mapping_visit_types[listed_origin.visit_type] += 1 assert dict(mapping_visit_types) == expected_visit_types def test_lister_nixguix_mostly_noop(datadir, swh_scheduler, requests_mock): """NixGuixLister should ignore unsupported or incomplete or to ignore origins""" url = SOURCES["nixpkgs"]["manifest"] origin_upstream = SOURCES["nixpkgs"]["repo"] lister = NixGuixLister( swh_scheduler, url=url, origin_upstream=origin_upstream, extensions_to_ignore=["foobar"], ) response = page_response(datadir, "failure") requests_mock.get( url, [{"json": response}], ) # Amongst artifacts, this url does not allow to determine its nature (tarball, file) # It's ending up doing a http head query which ends up being 404, so it's skipped. requests_mock.head( "https://crates.io/api/v1/0.1.5/no-extension-and-head-404-so-skipped", status_code=404, ) # Invalid schema for that origin (and no extension), so skip origin # from its name requests_mock.head( "ftp://ftp.ourproject.org/file-with-no-extension", exc=InvalidSchema, ) # Cannot communicate with an expired cert, so skip origin requests_mock.head( "https://code.9front.org/hg/plan9front", exc=SSLError, ) # Cannot connect to the site, so skip origin requests_mock.head( "https://git-tails.immerda.ch/onioncircuits", exc=ConnectionError, ) listed_result = lister.run() - # only the origin upstream is listed, every other entries are unsupported or incomplete - assert listed_result == ListerStats(pages=1, origins=1) + expected_origins = ["https://github.com/NixOS/nixpkgs"] scheduler_origins = lister.scheduler.get_listed_origins( lister.lister_obj.id ).results - assert len(scheduler_origins) == 1 + scheduler_origin_urls = [orig.url for orig in scheduler_origins] + + assert scheduler_origin_urls == expected_origins + + # only the origin upstream is listed, every other entries are unsupported or incomplete + assert listed_result == ListerStats(pages=1, origins=1), ( + f"Expected origins: {' '.join(expected_origins)}, got: " + f"{' '.join(scheduler_origin_urls)}" + ) assert scheduler_origins[0].visit_type == "git" def test_lister_nixguix_fail(datadir, swh_scheduler, requests_mock): url = SOURCES["nixpkgs"]["manifest"] origin_upstream = SOURCES["nixpkgs"]["repo"] lister = NixGuixLister(swh_scheduler, url=url, origin_upstream=origin_upstream) requests_mock.get( url, status_code=404, ) with pytest.raises(requests.HTTPError): # listing cannot continues so stop lister.run() scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 0 diff --git a/swh/lister/opam/tests/test_lister.py b/swh/lister/opam/tests/test_lister.py index 26dc753..26526ba 100644 --- a/swh/lister/opam/tests/test_lister.py +++ b/swh/lister/opam/tests/test_lister.py @@ -1,170 +1,188 @@ # Copyright (C) 2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import io import os from tempfile import mkdtemp from unittest.mock import MagicMock import pytest from swh.lister.opam.lister import OpamLister, opam_init module_name = "swh.lister.opam.lister" @pytest.fixture def mock_opam(mocker): """Fixture to bypass the actual opam calls within the test context.""" # inhibits the real `subprocess.call` which prepares the required internal opam # state mock_init = mocker.patch(f"{module_name}.call", return_value=None) # replaces the real Popen with a fake one (list origins command) mocked_popen = MagicMock() mocked_popen.stdout = io.BytesIO(b"bar\nbaz\nfoo\n") mock_open = mocker.patch(f"{module_name}.Popen", return_value=mocked_popen) return mock_init, mock_open def test_mock_init_repository_init(mock_opam, tmp_path, datadir): """Initializing opam root directory with an instance should be ok""" mock_init, mock_popen = mock_opam instance = "fake" instance_url = f"file://{datadir}/{instance}" opam_root = str(tmp_path / "test-opam") assert not os.path.exists(opam_root) # This will initialize an opam directory with the instance opam_init(opam_root, instance, instance_url, {}) assert mock_init.called def test_mock_init_repository_update(mock_opam, tmp_path, datadir): """Updating opam root directory with another instance should be ok""" mock_init, mock_popen = mock_opam instance = "fake_opam_repo" - instance_url = f"file://{datadir}/{instance}" + instance_url = f"http://example.org/{instance}" opam_root = str(tmp_path / "test-opam") os.makedirs(opam_root, exist_ok=True) with open(os.path.join(opam_root, "opam"), "w") as f: f.write("one file to avoid empty folder") assert os.path.exists(opam_root) assert os.listdir(opam_root) == ["opam"] # not empty # This will update the repository opam with another instance opam_init(opam_root, instance, instance_url, {}) assert mock_init.called def test_lister_opam_optional_instance(swh_scheduler): """Instance name should be optional and default to be built out of the netloc.""" netloc = "opam.ocaml.org" instance_url = f"https://{netloc}" lister = OpamLister( swh_scheduler, url=instance_url, ) assert lister.instance == netloc assert lister.opam_root == "/tmp/opam/" def test_urls(swh_scheduler, mock_opam, tmp_path): mock_init, mock_popen = mock_opam instance_url = "https://opam.ocaml.org" tmp_folder = mkdtemp(dir=tmp_path, prefix="swh_opam_lister") lister = OpamLister( swh_scheduler, url=instance_url, instance="opam", opam_root=tmp_folder, ) assert lister.instance == "opam" assert lister.opam_root == tmp_folder # call the lister and get all listed origins urls stats = lister.run() assert mock_init.called assert mock_popen.called assert stats.pages == 3 assert stats.origins == 3 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results expected_urls = [ f"opam+{instance_url}/packages/bar/", f"opam+{instance_url}/packages/baz/", f"opam+{instance_url}/packages/foo/", ] result_urls = [origin.url for origin in scheduler_origins] assert expected_urls == result_urls -def test_opam_binary(datadir, swh_scheduler, tmp_path): - instance_url = f"file://{datadir}/fake_opam_repo" +def test_opam_binary(datadir, swh_scheduler, tmp_path, mocker): + from swh.lister.opam.lister import opam_init + + instance_url = "http://example.org/fake_opam_repo" + + def mock_opam_init(opam_root, instance, url, env): + assert url == instance_url + return opam_init(opam_root, instance, f"{datadir}/fake_opam_repo", env) + + # Patch opam_init to use the local directory + mocker.patch("swh.lister.opam.lister.opam_init", side_effect=mock_opam_init) lister = OpamLister( swh_scheduler, url=instance_url, instance="fake", opam_root=mkdtemp(dir=tmp_path, prefix="swh_opam_lister"), ) stats = lister.run() assert stats.pages == 4 assert stats.origins == 4 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results expected_urls = [ f"opam+{instance_url}/packages/agrid/", f"opam+{instance_url}/packages/calculon/", f"opam+{instance_url}/packages/directories/", f"opam+{instance_url}/packages/ocb/", ] result_urls = [origin.url for origin in scheduler_origins] assert expected_urls == result_urls -def test_opam_multi_instance(datadir, swh_scheduler, tmp_path): - instance_url = f"file://{datadir}/fake_opam_repo" +def test_opam_multi_instance(datadir, swh_scheduler, tmp_path, mocker): + from swh.lister.opam.lister import opam_init + + instance_url = "http://example.org/fake_opam_repo" + + def mock_opam_init(opam_root, instance, url, env): + assert url == instance_url + return opam_init(opam_root, instance, f"{datadir}/fake_opam_repo", env) + + # Patch opam_init to use the local directory + mocker.patch("swh.lister.opam.lister.opam_init", side_effect=mock_opam_init) lister = OpamLister( swh_scheduler, url=instance_url, instance="fake", opam_root=mkdtemp(dir=tmp_path, prefix="swh_opam_lister"), ) stats = lister.run() assert stats.pages == 4 assert stats.origins == 4 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results expected_urls = [ f"opam+{instance_url}/packages/agrid/", f"opam+{instance_url}/packages/calculon/", f"opam+{instance_url}/packages/directories/", f"opam+{instance_url}/packages/ocb/", ] result_urls = [origin.url for origin in scheduler_origins] assert expected_urls == result_urls diff --git a/swh/lister/packagist/tests/data/payrix_payrix-php.json b/swh/lister/packagist/tests/data/payrix_payrix-php.json new file mode 100644 index 0000000..43a6c77 --- /dev/null +++ b/swh/lister/packagist/tests/data/payrix_payrix-php.json @@ -0,0 +1,151 @@ +{ + "packages": { + "payrix/payrix-php": { + "dev-master": { + "name": "payrix/payrix-php", + "description": "PayrixPHP PHP SDK package", + "keywords": [], + "homepage": "https://portal.payrix.com", + "version": "dev-master", + "version_normalized": "9999999-dev", + "license": [ + "Apache-2.0" + ], + "authors": [], + "source": { + "url": "git@gitlab.com:payrix/public/payrix-php.git", + "type": "git", + "reference": "cf02195d3c32424396932e087824bf581966e703" + }, + "dist": { + "url": "https://gitlab.com/api/v4/projects/payrix%2Fpublic%2Fpayrix-php/repository/archive.zip?sha=cf02195d3c32424396932e087824bf581966e703", + "type": "zip", + "shasum": "", + "reference": "cf02195d3c32424396932e087824bf581966e703" + }, + "type": "library", + "time": "2021-05-25T14:12:28+00:00", + "autoload": { + "psr-4": { + "PayrixPHP\\": "lib/" + } + }, + "default-branch": true, + "require": { + "php": ">=5.4.0", + "ext-curl": "*", + "ext-openssl": "*" + }, + "uid": 4416889 + }, + "v2.0.0": { + "name": "payrix/payrix-php", + "description": "PayrixPHP PHP SDK package", + "keywords": [], + "homepage": "https://portal.payrix.com", + "version": "v2.0.0", + "version_normalized": "2.0.0.0", + "license": [ + "Apache-2.0" + ], + "authors": [], + "source": { + "url": "https://gitlab.com/payrix/public/payrix-php.git", + "type": "git", + "reference": "4b40ad457a5cdbddb384b4d8f2c62d8d8c04ce68" + }, + "dist": { + "url": "https://gitlab.com/api/v4/projects/payrix%2Fpublic%2Fpayrix-php/repository/archive.zip?sha=4b40ad457a5cdbddb384b4d8f2c62d8d8c04ce68", + "type": "zip", + "shasum": "", + "reference": "4b40ad457a5cdbddb384b4d8f2c62d8d8c04ce68" + }, + "type": "library", + "time": "2020-09-03T11:26:52+00:00", + "autoload": { + "psr-4": { + "PayrixPHP\\": "lib/" + } + }, + "require": { + "php": ">=5.4.0", + "ext-curl": "*", + "ext-openssl": "*" + }, + "uid": 4416947 + }, + "v2.0.1": { + "name": "payrix/payrix-php", + "description": "PayrixPHP PHP SDK package", + "keywords": [], + "homepage": "https://portal.payrix.com", + "version": "v2.0.1", + "version_normalized": "2.0.1.0", + "license": [ + "Apache-2.0" + ], + "authors": [], + "source": { + "url": "https://gitlab.com/payrix/public/payrix-php.git", + "type": "git", + "reference": "9693f2dff0a589e16c88a9bf838069ab89166103" + }, + "dist": { + "url": "https://gitlab.com/api/v4/projects/payrix%2Fpublic%2Fpayrix-php/repository/archive.zip?sha=9693f2dff0a589e16c88a9bf838069ab89166103", + "type": "zip", + "shasum": "", + "reference": "9693f2dff0a589e16c88a9bf838069ab89166103" + }, + "type": "library", + "time": "2021-05-10T02:32:57+00:00", + "autoload": { + "psr-4": { + "PayrixPHP\\": "lib/" + } + }, + "require": { + "php": ">=5.4.0", + "ext-curl": "*", + "ext-openssl": "*" + }, + "uid": 5183918 + }, + "v2.0.2": { + "name": "payrix/payrix-php", + "description": "PayrixPHP PHP SDK package", + "keywords": [], + "homepage": "https://portal.payrix.com", + "version": "v2.0.2", + "version_normalized": "2.0.2.0", + "license": [ + "Apache-2.0" + ], + "authors": [], + "source": { + "url": "https://gitlab.com/payrix/public/payrix-php.git", + "type": "git", + "reference": "cf02195d3c32424396932e087824bf581966e703" + }, + "dist": { + "url": "https://gitlab.com/api/v4/projects/payrix%2Fpublic%2Fpayrix-php/repository/archive.zip?sha=cf02195d3c32424396932e087824bf581966e703", + "type": "zip", + "shasum": "", + "reference": "cf02195d3c32424396932e087824bf581966e703" + }, + "type": "library", + "time": "2021-05-25T10:12:28+00:00", + "autoload": { + "psr-4": { + "PayrixPHP\\": "lib/" + } + }, + "require": { + "php": ">=5.4.0", + "ext-curl": "*", + "ext-openssl": "*" + }, + "uid": 5232658 + } + } + } +} diff --git a/swh/lister/packagist/tests/data/with_invalid_url.json b/swh/lister/packagist/tests/data/with_invalid_url.json new file mode 100644 index 0000000..4b281ea --- /dev/null +++ b/swh/lister/packagist/tests/data/with_invalid_url.json @@ -0,0 +1,24 @@ +{ + "packages": { + "ycms/module-main": { + "dev-master": { + "name": "with/invalid_url", + "description": "", + "keywords": [], + "homepage": "", + "version": "dev-master", + "version_normalized": "9999999-dev", + "license": [], + "authors": [], + "source": { + "type": "git", + "url": "git@example.org/invalid/url.git", + "reference": "0000000000000000000000000000000000000000" + }, + "time": "2015-08-23T04:42:33+00:00", + "default-branch": true, + "uid": 4064797 + } + } + } +} diff --git a/swh/lister/packagist/tests/test_lister.py b/swh/lister/packagist/tests/test_lister.py index e2782ee..4f512a2 100644 --- a/swh/lister/packagist/tests/test_lister.py +++ b/swh/lister/packagist/tests/test_lister.py @@ -1,199 +1,201 @@ # Copyright (C) 2019-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import datetime import json from pathlib import Path from swh.lister.packagist.lister import PackagistLister _packages_list = { "packageNames": [ "ljjackson/linnworks", "lky/wx_article", "spryker-eco/computop-api", - "idevlab/essential", + "idevlab/essential", # Git SSH URL + "payrix/payrix-php", + "with/invalid_url", # invalid URL ] } def _package_metadata(datadir, package_name): return json.loads( Path(datadir, f"{package_name.replace('/', '_')}.json").read_text() ) def _request_without_if_modified_since(request): return request.headers.get("If-Modified-Since") is None def _request_with_if_modified_since(request): return request.headers.get("If-Modified-Since") is not None def test_packagist_lister(swh_scheduler, requests_mock, datadir, requests_mock_datadir): # first listing, should return one origin per package lister = PackagistLister(scheduler=swh_scheduler) requests_mock.get(lister.PACKAGIST_PACKAGES_LIST_URL, json=_packages_list) packages_metadata = {} for package_name in _packages_list["packageNames"]: metadata = _package_metadata(datadir, package_name) packages_metadata[package_name] = metadata requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", json=metadata, additional_matcher=_request_without_if_modified_since, ) stats = lister.run() assert stats.pages == 1 - assert stats.origins == len(_packages_list["packageNames"]) + assert stats.origins == len(_packages_list["packageNames"]) - 2 assert lister.updated expected_origins = { ( "https://github.com/gitlky/wx_article", # standard case "git", datetime.datetime.fromisoformat("2018-08-30T07:37:09+00:00"), ), ( "https://github.com/ljjackson/linnworks.git", # API goes 404 "git", datetime.datetime.fromisoformat("2018-11-01T21:45:50+00:00"), ), ( "https://github.com/spryker-eco/computop-api", # SSH URL in manifest "git", datetime.datetime.fromisoformat("2020-06-22T15:50:29+00:00"), ), ( - "git@gitlab.com:idevlab/Essential.git", # not GitHub + "https://gitlab.com/payrix/public/payrix-php.git", # not GitHub "git", - datetime.datetime.fromisoformat("2022-10-12T10:34:29+00:00"), + datetime.datetime.fromisoformat("2021-05-25T14:12:28+00:00"), ), } assert expected_origins == { (o.url, o.visit_type, o.last_update) for o in swh_scheduler.get_listed_origins(lister.lister_obj.id).results } # second listing, should return 0 origins as no package metadata # has been updated since first listing lister = PackagistLister(scheduler=swh_scheduler) for package_name in _packages_list["packageNames"]: requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", additional_matcher=_request_with_if_modified_since, status_code=304, ) assert lister.get_state_from_scheduler().last_listing_date is not None stats = lister.run() assert stats.pages == 1 assert stats.origins == 0 assert lister.updated assert expected_origins == { (o.url, o.visit_type, o.last_update) for o in swh_scheduler.get_listed_origins(lister.lister_obj.id).results } def test_packagist_lister_missing_metadata(swh_scheduler, requests_mock, datadir): lister = PackagistLister(scheduler=swh_scheduler) requests_mock.get(lister.PACKAGIST_PACKAGES_LIST_URL, json=_packages_list) for package_name in _packages_list["packageNames"]: requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", additional_matcher=_request_without_if_modified_since, status_code=404, ) stats = lister.run() assert stats.pages == 1 assert stats.origins == 0 def test_packagist_lister_empty_metadata(swh_scheduler, requests_mock, datadir): lister = PackagistLister(scheduler=swh_scheduler) requests_mock.get(lister.PACKAGIST_PACKAGES_LIST_URL, json=_packages_list) for package_name in _packages_list["packageNames"]: requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", additional_matcher=_request_without_if_modified_since, json={"packages": {}}, ) stats = lister.run() assert stats.pages == 1 assert stats.origins == 0 def test_packagist_lister_package_with_bitbucket_hg_origin( swh_scheduler, requests_mock, datadir ): package_name = "den1n/contextmenu" lister = PackagistLister(scheduler=swh_scheduler) requests_mock.get( lister.PACKAGIST_PACKAGES_LIST_URL, json={"packageNames": [package_name]} ) requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", additional_matcher=_request_without_if_modified_since, json=_package_metadata(datadir, package_name), ) stats = lister.run() assert stats.pages == 1 assert stats.origins == 0 def test_packagist_lister_package_normalize_github_origin( swh_scheduler, requests_mock, datadir, requests_mock_datadir ): package_name = "ycms/module-main" lister = PackagistLister(scheduler=swh_scheduler) requests_mock.get( lister.PACKAGIST_PACKAGES_LIST_URL, json={"packageNames": [package_name]} ) requests_mock.get( f"{lister.PACKAGIST_REPO_BASE_URL}/{package_name}.json", additional_matcher=_request_without_if_modified_since, json=_package_metadata(datadir, package_name), ) stats = lister.run() assert stats.pages == 1 assert stats.origins == 1 expected_origins = { ( "https://github.com/GameCHN/module-main", "git", datetime.datetime.fromisoformat("2015-08-23T04:42:33+00:00"), ), } assert expected_origins == { (o.url, o.visit_type, o.last_update) for o in swh_scheduler.get_listed_origins(lister.lister_obj.id).results } def test_lister_from_configfile(swh_scheduler_config, mocker): load_from_envvar = mocker.patch("swh.lister.pattern.load_from_envvar") load_from_envvar.return_value = { "scheduler": {"cls": "local", **swh_scheduler_config}, "credentials": {}, } lister = PackagistLister.from_configfile() assert lister.scheduler is not None assert lister.credentials is not None diff --git a/swh/lister/pattern.py b/swh/lister/pattern.py index 5b3a33d..8a1b497 100644 --- a/swh/lister/pattern.py +++ b/swh/lister/pattern.py @@ -1,332 +1,339 @@ # Copyright (C) 2020-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from __future__ import annotations from dataclasses import dataclass import logging from typing import Any, Dict, Generic, Iterable, Iterator, List, Optional, Set, TypeVar from urllib.parse import urlparse import requests from tenacity.before_sleep import before_sleep_log from swh.core.config import load_from_envvar from swh.core.github.utils import GitHubSession from swh.core.utils import grouper from swh.scheduler import get_scheduler, model from swh.scheduler.interface import SchedulerInterface from . import USER_AGENT_TEMPLATE -from .utils import http_retry +from .utils import http_retry, is_valid_origin_url logger = logging.getLogger(__name__) @dataclass class ListerStats: pages: int = 0 origins: int = 0 def __add__(self, other: ListerStats) -> ListerStats: return self.__class__(self.pages + other.pages, self.origins + other.origins) def __iadd__(self, other: ListerStats): self.pages += other.pages self.origins += other.origins def dict(self) -> Dict[str, int]: return {"pages": self.pages, "origins": self.origins} StateType = TypeVar("StateType") PageType = TypeVar("PageType") BackendStateType = Dict[str, Any] CredentialsType = Optional[Dict[str, Dict[str, List[Dict[str, str]]]]] class Lister(Generic[StateType, PageType]): """The base class for a Software Heritage lister. A lister scrapes a page by page list of origins from an upstream (a forge, the API of a package manager, ...), and massages the results of that scrape into a list of origins that are recorded by the scheduler backend. The main loop of the lister, :meth:`run`, basically revolves around the :meth:`get_pages` iterator, which sets up the lister state, then yields the scrape results page by page. The :meth:`get_origins_from_page` method converts the pages into a list of :class:`model.ListedOrigin`, sent to the scheduler at every page. The :meth:`commit_page` method can be used to update the lister state after a page of origins has been recorded in the scheduler backend. The :func:`finalize` method is called at lister teardown (whether the run has been successful or not) to update the local :attr:`state` object before it's sent to the database. This method must set the :attr:`updated` attribute if an updated state needs to be sent to the scheduler backend. This method can call :func:`get_state_from_scheduler` to refresh and merge the lister state from the scheduler before it's finalized (and potentially minimize the risk of race conditions between concurrent runs of the lister). The state of the lister is serialized and deserialized from the dict stored in the scheduler backend, using the :meth:`state_from_dict` and :meth:`state_to_dict` methods. Args: scheduler: the instance of the Scheduler being used to register the origins listed by this lister url: a URL representing this lister, e.g. the API's base URL instance: the instance name, to uniquely identify this lister instance, if not provided the URL network location will be used credentials: dictionary of credentials for all listers. The first level identifies the :attr:`LISTER_NAME`, the second level the lister :attr:`instance`. The final level is a list of dicts containing the expected credentials for the given instance of that lister. Generic types: - *StateType*: concrete lister type; should usually be a :class:`dataclass` for stricter typing - *PageType*: type of scrape results; can usually be a :class:`requests.Response`, or a :class:`dict` """ LISTER_NAME: str = "" github_session: Optional[GitHubSession] = None def __init__( self, scheduler: SchedulerInterface, url: str, instance: Optional[str] = None, credentials: CredentialsType = None, with_github_session: bool = False, ): if not self.LISTER_NAME: raise ValueError("Must set the LISTER_NAME attribute on Lister classes") self.url = url if instance is not None: self.instance = instance else: self.instance = urlparse(url).netloc self.scheduler = scheduler if not credentials: credentials = {} self.credentials = list( credentials.get(self.LISTER_NAME, {}).get(self.instance, []) ) # store the initial state of the lister self.state = self.get_state_from_scheduler() self.updated = False self.session = requests.Session() # Declare the USER_AGENT is more sysadm-friendly for the forge we list self.session.headers.update( {"User-Agent": USER_AGENT_TEMPLATE % self.LISTER_NAME} ) self.github_session: Optional[GitHubSession] = ( GitHubSession( credentials=credentials.get("github", {}).get("github", []), user_agent=str(self.session.headers["User-Agent"]), ) if with_github_session else None ) self.recorded_origins: Set[str] = set() @http_retry(before_sleep=before_sleep_log(logger, logging.WARNING)) def http_request(self, url: str, method="GET", **kwargs) -> requests.Response: logger.debug("Fetching URL %s with params %s", url, kwargs.get("params")) response = self.session.request(method, url, **kwargs) if response.status_code not in (200, 304): logger.warning( "Unexpected HTTP status code %s on %s: %s", response.status_code, response.url, response.content, ) response.raise_for_status() return response def run(self) -> ListerStats: """Run the lister. Returns: A counter with the number of pages and origins seen for this run of the lister. """ full_stats = ListerStats() self.recorded_origins = set() try: for page in self.get_pages(): full_stats.pages += 1 origins = self.get_origins_from_page(page) sent_origins = self.send_origins(origins) self.recorded_origins.update(sent_origins) full_stats.origins = len(self.recorded_origins) self.commit_page(page) finally: self.finalize() if self.updated: self.set_state_in_scheduler() return full_stats def get_state_from_scheduler(self) -> StateType: """Update the state in the current instance from the state in the scheduler backend. This updates :attr:`lister_obj`, and returns its (deserialized) current state, to allow for comparison with the local state. Returns: the state retrieved from the scheduler backend """ self.lister_obj = self.scheduler.get_or_create_lister( name=self.LISTER_NAME, instance_name=self.instance ) return self.state_from_dict(self.lister_obj.current_state) def set_state_in_scheduler(self) -> None: """Update the state in the scheduler backend from the state of the current instance. Raises: swh.scheduler.exc.StaleData: in case of a race condition between concurrent listers (from :meth:`swh.scheduler.Scheduler.update_lister`). """ self.lister_obj.current_state = self.state_to_dict(self.state) self.lister_obj = self.scheduler.update_lister(self.lister_obj) # State management to/from the scheduler def state_from_dict(self, d: BackendStateType) -> StateType: """Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.""" raise NotImplementedError def state_to_dict(self, state: StateType) -> BackendStateType: """Convert the StateType for this lister to its serialization as dict for storage in the scheduler. Values must be JSON-compatible as that's what the backend database expects. """ raise NotImplementedError def finalize(self) -> None: """Custom hook to finalize the lister state before returning from the main loop. This method must set :attr:`updated` if the lister has done some work. If relevant, this method can use :meth`get_state_from_scheduler` to merge the current lister state with the one from the scheduler backend, reducing the risk of race conditions if we're running concurrent listings. This method is called in a `finally` block, which means it will also run when the lister fails. """ pass # Actual listing logic def get_pages(self) -> Iterator[PageType]: """Retrieve a list of pages of listed results. This is the main loop of the lister. Returns: an iterator of raw pages fetched from the platform currently being listed. """ raise NotImplementedError def get_origins_from_page(self, page: PageType) -> Iterator[model.ListedOrigin]: """Extract a list of :class:`model.ListedOrigin` from a raw page of results. Args: page: a single page of results Returns: an iterator for the origins present on the given page of results """ raise NotImplementedError def commit_page(self, page: PageType) -> None: """Custom hook called after the current page has been committed in the scheduler backend. This method can be used to update the state after a page of origins has been successfully recorded in the scheduler backend. If the new state should be recorded at the point the lister completes, the :attr:`updated` attribute must be set. """ pass def send_origins(self, origins: Iterable[model.ListedOrigin]) -> List[str]: """Record a list of :class:`model.ListedOrigin` in the scheduler. Returns: the list of origin URLs recorded in scheduler database """ + valid_origins = [] + for origin in origins: + if is_valid_origin_url(origin.url): + valid_origins.append(origin) + else: + logger.warning("Skipping invalid origin: %s", origin.url) + recorded_origins = [] - for batch_origins in grouper(origins, n=1000): + for batch_origins in grouper(valid_origins, n=1000): ret = self.scheduler.record_listed_origins(batch_origins) recorded_origins += [origin.url for origin in ret] return recorded_origins @classmethod def from_config(cls, scheduler: Dict[str, Any], **config: Any): """Instantiate a lister from a configuration dict. This is basically a backwards-compatibility shim for the CLI. Args: scheduler: instantiation config for the scheduler config: the configuration dict for the lister, with the following keys: - credentials (optional): credentials list for the scheduler - any other kwargs passed to the lister. Returns: the instantiated lister """ # Drop the legacy config keys which aren't used for this generation of listers. for legacy_key in ("storage", "lister", "celery"): config.pop(legacy_key, None) # Instantiate the scheduler scheduler_instance = get_scheduler(**scheduler) return cls(scheduler=scheduler_instance, **config) @classmethod def from_configfile(cls, **kwargs: Any): """Instantiate a lister from the configuration loaded from the SWH_CONFIG_FILENAME envvar, with potential extra keyword arguments if their value is not None. Args: kwargs: kwargs passed to the lister instantiation """ config = dict(load_from_envvar()) config.update({k: v for k, v in kwargs.items() if v is not None}) return cls.from_config(**config) class StatelessLister(Lister[None, PageType], Generic[PageType]): def state_from_dict(self, d: BackendStateType) -> None: """Always return empty state""" return None def state_to_dict(self, state: None) -> BackendStateType: """Always set empty state""" return {} diff --git a/swh/lister/pubdev/__init__.py b/swh/lister/pubdev/__init__.py index 310595f..06ca022 100644 --- a/swh/lister/pubdev/__init__.py +++ b/swh/lister/pubdev/__init__.py @@ -1,71 +1,71 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Pub.dev lister ============== The Pubdev lister list origins from `pub.dev`_, the `Dart`_ and `Flutter`_ packages registry. The registry provide an `http api`_ from where the lister retrieve package names. As of August 2022 `pub.dev`_ list 33535 package names. Origins retrieving strategy --------------------------- -To get a list of all package names we call `https://pub.dev/api/packages` endpoint. +To get a list of all package names we call `https://pub.dev/api/package-names` endpoint. There is no other way for discovery (no archive index, no database dump, no dvcs repository). -Page listing ------------- - -There is only one page that list all origins url based -on `https://pub.dev/api/packages/{pkgname}`. -The origin url corresponds to the http api endpoint that returns complete information -about the package versions (name, version, author, description, release date). - Origins from page ----------------- -The lister yields all origins url from one page. +The lister yields all origin urls from a single page. + +Getting last update date for each package +----------------------------------------- + +Before sending a listed pubdev origin to the scheduler, we query the +`https://pub.dev/api/packages/{pkgname}` endpoint to get the last update date +for a package (date of its latest release). It enables Software Heritage to create +new loading task for a package only if it has new releases since last visit. Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/pubdev/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: docker-compose up -d Then schedule a pubdev listing task:: docker compose exec swh-scheduler swh scheduler task add -p oneshot list-pubdev You can follow lister execution by displaying logs of swh-lister service:: docker compose logs -f swh-lister .. _pub.dev: https://pub.dev .. _Dart: https://dart.dev .. _Flutter: https://flutter.dev .. _http api: https://pub.dev/help/api """ def register(): from .lister import PubDevLister return { "lister": PubDevLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/utils.py b/swh/lister/utils.py index 125b31b..3220d4d 100644 --- a/swh/lister/utils.py +++ b/swh/lister/utils.py @@ -1,113 +1,161 @@ # Copyright (C) 2018-2022 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -from typing import Callable, Iterator, Tuple +from typing import Callable, Iterator, Optional, Tuple +import urllib.parse from requests.exceptions import ConnectionError, HTTPError from requests.status_codes import codes from tenacity import retry as tenacity_retry from tenacity.stop import stop_after_attempt from tenacity.wait import wait_exponential def split_range(total_pages: int, nb_pages: int) -> Iterator[Tuple[int, int]]: """Split `total_pages` into mostly `nb_pages` ranges. In some cases, the last range can have one more element. >>> list(split_range(19, 10)) [(0, 9), (10, 19)] >>> list(split_range(20, 3)) [(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 20)] >>> list(split_range(21, 3)) [(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 21)] """ prev_index = None for index in range(0, total_pages, nb_pages): if index is not None and prev_index is not None: yield prev_index, index - 1 prev_index = index if index != total_pages: yield index, total_pages def is_throttling_exception(e: Exception) -> bool: """ Checks if an exception is a requests.exception.HTTPError for a response with status code 429 (too many requests). """ return ( isinstance(e, HTTPError) and e.response.status_code == codes.too_many_requests ) def is_retryable_exception(e: Exception) -> bool: """ Checks if an exception is worth retrying (connection, throttling or a server error). """ is_connection_error = isinstance(e, ConnectionError) is_500_error = isinstance(e, HTTPError) and e.response.status_code >= 500 return is_connection_error or is_throttling_exception(e) or is_500_error def retry_if_exception(retry_state, predicate: Callable[[Exception], bool]) -> bool: """ Custom tenacity retry predicate for handling exceptions with the given predicate. """ attempt = retry_state.outcome if attempt.failed: exception = attempt.exception() return predicate(exception) return False def retry_policy_generic(retry_state) -> bool: """ Custom tenacity retry predicate for handling failed requests: - ConnectionError - Server errors (status >= 500) - Throttling errors (status == 429) This does not handle 404, 403 or other status codes. """ return retry_if_exception(retry_state, is_retryable_exception) WAIT_EXP_BASE = 10 MAX_NUMBER_ATTEMPTS = 5 def http_retry( retry=retry_policy_generic, wait=wait_exponential(exp_base=WAIT_EXP_BASE), stop=stop_after_attempt(max_attempt_number=MAX_NUMBER_ATTEMPTS), **retry_args, ): """ Decorator based on `tenacity` for retrying a function possibly raising requests.exception.HTTPError for status code 429 (too many requests). It provides a default configuration that should work properly in most cases but all `tenacity.retry` parameters can also be overridden in client code. When the mmaximum of attempts is reached, the HTTPError exception will then be reraised. Args: retry: function defining request retry condition (default to 429 status code) https://tenacity.readthedocs.io/en/latest/#whether-to-retry wait: function defining wait strategy before retrying (default to exponential backoff) https://tenacity.readthedocs.io/en/latest/#waiting-before-retrying stop: function defining when to stop retrying (default after 5 attempts) https://tenacity.readthedocs.io/en/latest/#stopping """ return tenacity_retry(retry=retry, wait=wait, stop=stop, reraise=True, **retry_args) + + +def is_valid_origin_url(url: Optional[str]) -> bool: + """Returns whether the given string is a valid origin URL. + This excludes Git SSH URLs and pseudo-URLs (eg. ``ssh://git@example.org:foo`` + and ``git@example.org:foo``), as they are not supported by the Git loader + and usually require authentication. + + All HTTP URLs are allowed: + + >>> is_valid_origin_url("http://example.org/repo.git") + True + >>> is_valid_origin_url("http://example.org/repo") + True + >>> is_valid_origin_url("https://example.org/repo") + True + >>> is_valid_origin_url("https://foo:bar@example.org/repo") + True + + Scheme-less URLs are rejected; + + >>> is_valid_origin_url("example.org/repo") + False + >>> is_valid_origin_url("example.org:repo") + False + + Git SSH URLs and pseudo-URLs are rejected: + + >>> is_valid_origin_url("git@example.org:repo") + False + >>> is_valid_origin_url("ssh://git@example.org:repo") + False + """ + if not url: + # Empty or None + return False + + parsed = urllib.parse.urlparse(url) + if not parsed.netloc: + # Is parsed as a relative URL + return False + + if parsed.scheme == "ssh": + # Git SSH URL + return False + + return True