diff --git a/PKG-INFO b/PKG-INFO index e4106ab..e6f27f8 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,127 +1,127 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 2.9.0 +Version: 2.9.1 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO index e4106ab..e6f27f8 100644 --- a/swh.lister.egg-info/PKG-INFO +++ b/swh.lister.egg-info/PKG-INFO @@ -1,127 +1,127 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 2.9.0 +Version: 2.9.1 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh.lister.egg-info/SOURCES.txt b/swh.lister.egg-info/SOURCES.txt index 004158a..77b24f0 100644 --- a/swh.lister.egg-info/SOURCES.txt +++ b/swh.lister.egg-info/SOURCES.txt @@ -1,264 +1,265 @@ .git-blame-ignore-revs .gitignore .pre-commit-config.yaml ACKNOWLEDGEMENTS CODE_OF_CONDUCT.md CONTRIBUTORS LICENSE MANIFEST.in Makefile README.md conftest.py mypy.ini pyproject.toml pytest.ini requirements-swh.txt requirements-test.txt requirements.txt setup.cfg setup.py tox.ini docs/.gitignore docs/Makefile docs/cli.rst docs/conf.py docs/index.rst docs/new_lister_template.py docs/run_a_new_lister.rst docs/save_forge.rst docs/tutorial.rst docs/_static/.placeholder docs/_templates/.placeholder docs/images/new_base.png docs/images/new_bitbucket_lister.png docs/images/new_github_lister.png docs/images/old_github_lister.png sql/crawler.sql sql/pimp_db.sql swh/__init__.py swh.lister.egg-info/PKG-INFO swh.lister.egg-info/SOURCES.txt swh.lister.egg-info/dependency_links.txt swh.lister.egg-info/entry_points.txt swh.lister.egg-info/requires.txt swh.lister.egg-info/top_level.txt swh/lister/__init__.py swh/lister/cli.py swh/lister/pattern.py swh/lister/py.typed swh/lister/utils.py swh/lister/bitbucket/__init__.py swh/lister/bitbucket/lister.py swh/lister/bitbucket/tasks.py swh/lister/bitbucket/tests/__init__.py swh/lister/bitbucket/tests/test_lister.py swh/lister/bitbucket/tests/test_tasks.py swh/lister/bitbucket/tests/data/bb_api_repositories_page1.json swh/lister/bitbucket/tests/data/bb_api_repositories_page2.json swh/lister/cgit/__init__.py swh/lister/cgit/lister.py swh/lister/cgit/tasks.py swh/lister/cgit/tests/__init__.py swh/lister/cgit/tests/repo_list.txt swh/lister/cgit/tests/test_lister.py swh/lister/cgit/tests/test_tasks.py swh/lister/cgit/tests/data/https_git.baserock.org/cgit swh/lister/cgit/tests/data/https_git.eclipse.org/c swh/lister/cgit/tests/data/https_git.savannah.gnu.org/README swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit swh/lister/cgit/tests/data/https_git.savannah.gnu.org/cgit_elisp-es.git swh/lister/cgit/tests/data/https_git.tizen/README swh/lister/cgit/tests/data/https_git.tizen/cgit swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=100 swh/lister/cgit/tests/data/https_git.tizen/cgit,ofs=50 swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_All-Users swh/lister/cgit/tests/data/https_git.tizen/cgit_Lock-Projects swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-base swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_alsa-scenario-scn-data-0-mc1n2 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e3250 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_ap_samsung_audio-hal-e4x12 swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_devices_nfc-plugin-nxp swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_intel_mfld_bootstub-mfld-blackbay swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_mtdev swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_opengl-es-virtual-drv swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libdrm swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_panda_libnl swh/lister/cgit/tests/data/https_git.tizen/cgit_adaptation_xorg_driver_xserver-xorg-misc swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-gallery-efl swh/lister/cgit/tests/data/https_git.tizen/cgit_apps_core_preloaded_ug-setting-homescreen-efl swh/lister/cgit/tests/data/https_jff.email/cgit swh/lister/cran/__init__.py swh/lister/cran/list_all_packages.R swh/lister/cran/lister.py swh/lister/cran/tasks.py swh/lister/cran/tests/__init__.py swh/lister/cran/tests/test_lister.py swh/lister/cran/tests/test_tasks.py swh/lister/cran/tests/data/list-r-packages.json swh/lister/crates/__init__.py swh/lister/crates/lister.py swh/lister/crates/tasks.py swh/lister/crates/tests/__init__.py swh/lister/crates/tests/test_lister.py swh/lister/crates/tests/test_tasks.py swh/lister/crates/tests/data/fake-crates-repository.tar.gz swh/lister/crates/tests/data/fake_crates_repository_init.sh swh/lister/debian/__init__.py swh/lister/debian/lister.py swh/lister/debian/tasks.py swh/lister/debian/tests/__init__.py swh/lister/debian/tests/test_lister.py swh/lister/debian/tests/test_tasks.py swh/lister/debian/tests/data/Sources_bullseye swh/lister/debian/tests/data/Sources_buster swh/lister/debian/tests/data/Sources_stretch swh/lister/gitea/__init__.py swh/lister/gitea/lister.py swh/lister/gitea/tasks.py swh/lister/gitea/tests/__init__.py swh/lister/gitea/tests/test_lister.py swh/lister/gitea/tests/test_tasks.py swh/lister/gitea/tests/data/https_try.gitea.io/repos_page1 swh/lister/gitea/tests/data/https_try.gitea.io/repos_page2 swh/lister/github/__init__.py swh/lister/github/lister.py swh/lister/github/tasks.py swh/lister/github/utils.py swh/lister/github/tests/__init__.py swh/lister/github/tests/test_lister.py swh/lister/github/tests/test_tasks.py swh/lister/gitlab/__init__.py swh/lister/gitlab/lister.py swh/lister/gitlab/tasks.py swh/lister/gitlab/tests/__init__.py swh/lister/gitlab/tests/test_lister.py swh/lister/gitlab/tests/test_tasks.py swh/lister/gitlab/tests/data/https_foss.heptapod.net/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page1.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page2.json swh/lister/gitlab/tests/data/https_gite.lirmm.fr/api_response_page3.json swh/lister/gitlab/tests/data/https_gitlab.com/api_response_page1.json swh/lister/gnu/__init__.py swh/lister/gnu/lister.py swh/lister/gnu/tasks.py swh/lister/gnu/tree.py swh/lister/gnu/tests/__init__.py swh/lister/gnu/tests/test_lister.py swh/lister/gnu/tests/test_tasks.py swh/lister/gnu/tests/test_tree.py swh/lister/gnu/tests/data/tree.json swh/lister/gnu/tests/data/tree.min.json swh/lister/gnu/tests/data/https_ftp.gnu.org/tree.json.gz swh/lister/launchpad/__init__.py swh/lister/launchpad/lister.py swh/lister/launchpad/tasks.py swh/lister/launchpad/tests/__init__.py swh/lister/launchpad/tests/conftest.py swh/lister/launchpad/tests/test_lister.py swh/lister/launchpad/tests/test_tasks.py swh/lister/launchpad/tests/data/launchpad_bzr_response.json swh/lister/launchpad/tests/data/launchpad_response1.json swh/lister/launchpad/tests/data/launchpad_response2.json swh/lister/maven/README.md swh/lister/maven/__init__.py swh/lister/maven/lister.py swh/lister/maven/tasks.py swh/lister/maven/tests/__init__.py swh/lister/maven/tests/test_lister.py swh/lister/maven/tests/test_tasks.py -swh/lister/maven/tests/data/http_indexes/export.fld -swh/lister/maven/tests/data/http_indexes/export_incr.fld +swh/lister/maven/tests/data/http_indexes/export_full.fld +swh/lister/maven/tests/data/http_indexes/export_incr_first.fld +swh/lister/maven/tests/data/http_indexes/export_null_mtime.fld swh/lister/maven/tests/data/https_maven.org/arangodb-graphql-1.2.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.0.malformed.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.0.pom swh/lister/maven/tests/data/https_maven.org/sprova4j-0.1.1.pom swh/lister/npm/__init__.py swh/lister/npm/lister.py swh/lister/npm/tasks.py swh/lister/npm/tests/test_lister.py swh/lister/npm/tests/test_tasks.py swh/lister/npm/tests/data/npm_full_page1.json swh/lister/npm/tests/data/npm_full_page2.json swh/lister/npm/tests/data/npm_incremental_page1.json swh/lister/npm/tests/data/npm_incremental_page2.json swh/lister/opam/__init__.py swh/lister/opam/lister.py swh/lister/opam/tasks.py swh/lister/opam/tests/__init__.py swh/lister/opam/tests/test_lister.py swh/lister/opam/tests/test_tasks.py swh/lister/opam/tests/data/fake_opam_repo/repo swh/lister/opam/tests/data/fake_opam_repo/version swh/lister/opam/tests/data/fake_opam_repo/packages/agrid/agrid.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.4/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.5/opam swh/lister/opam/tests/data/fake_opam_repo/packages/calculon/calculon.0.6/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.1/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.2/opam swh/lister/opam/tests/data/fake_opam_repo/packages/directories/directories.0.3/opam swh/lister/opam/tests/data/fake_opam_repo/packages/ocb/ocb.0.1/opam swh/lister/packagist/__init__.py swh/lister/packagist/lister.py swh/lister/packagist/tasks.py swh/lister/packagist/tests/__init__.py swh/lister/packagist/tests/test_lister.py swh/lister/packagist/tests/test_tasks.py swh/lister/packagist/tests/data/den1n_contextmenu.json swh/lister/packagist/tests/data/ljjackson_linnworks.json swh/lister/packagist/tests/data/lky_wx_article.json swh/lister/packagist/tests/data/spryker-eco_computop-api.json swh/lister/phabricator/__init__.py swh/lister/phabricator/lister.py swh/lister/phabricator/tasks.py swh/lister/phabricator/tests/__init__.py swh/lister/phabricator/tests/test_lister.py swh/lister/phabricator/tests/test_tasks.py swh/lister/phabricator/tests/data/__init__.py swh/lister/phabricator/tests/data/phabricator_api_repositories_page1.json swh/lister/phabricator/tests/data/phabricator_api_repositories_page2.json swh/lister/pypi/__init__.py swh/lister/pypi/lister.py swh/lister/pypi/tasks.py swh/lister/pypi/tests/__init__.py swh/lister/pypi/tests/test_lister.py swh/lister/pypi/tests/test_tasks.py swh/lister/sourceforge/__init__.py swh/lister/sourceforge/lister.py swh/lister/sourceforge/tasks.py swh/lister/sourceforge/tests/__init__.py swh/lister/sourceforge/tests/test_lister.py swh/lister/sourceforge/tests/test_tasks.py swh/lister/sourceforge/tests/data/aaron.html swh/lister/sourceforge/tests/data/aaron.json swh/lister/sourceforge/tests/data/adobexmp.json swh/lister/sourceforge/tests/data/backapps-website.json swh/lister/sourceforge/tests/data/backapps.json swh/lister/sourceforge/tests/data/main-sitemap.xml swh/lister/sourceforge/tests/data/mojunk.json swh/lister/sourceforge/tests/data/mramm.json swh/lister/sourceforge/tests/data/ocaml-lpd.html swh/lister/sourceforge/tests/data/ocaml-lpd.json swh/lister/sourceforge/tests/data/os3dmodels.json swh/lister/sourceforge/tests/data/random-mercurial.json swh/lister/sourceforge/tests/data/subsitemap-0.xml swh/lister/sourceforge/tests/data/subsitemap-1.xml swh/lister/sourceforge/tests/data/t12eksandbox.html swh/lister/sourceforge/tests/data/t12eksandbox.json swh/lister/tests/__init__.py swh/lister/tests/test_cli.py swh/lister/tests/test_pattern.py swh/lister/tests/test_utils.py swh/lister/tuleap/__init__.py swh/lister/tuleap/lister.py swh/lister/tuleap/tasks.py swh/lister/tuleap/tests/__init__.py swh/lister/tuleap/tests/test_lister.py swh/lister/tuleap/tests/test_tasks.py swh/lister/tuleap/tests/data/https_tuleap.net/projects swh/lister/tuleap/tests/data/https_tuleap.net/repo_1 swh/lister/tuleap/tests/data/https_tuleap.net/repo_2 swh/lister/tuleap/tests/data/https_tuleap.net/repo_3 \ No newline at end of file diff --git a/swh/lister/crates/lister.py b/swh/lister/crates/lister.py index d0c6984..63604a1 100644 --- a/swh/lister/crates/lister.py +++ b/swh/lister/crates/lister.py @@ -1,145 +1,162 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json import logging from pathlib import Path import subprocess from typing import Any, Dict, Iterator, List +from urllib.parse import urlparse import iso8601 from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from ..pattern import CredentialsType, StatelessLister logger = logging.getLogger(__name__) # Aliasing the page results returned by `get_pages` method from the lister. CratesListerPage = List[Dict[str, Any]] class CratesLister(StatelessLister[CratesListerPage]): """List origins from the "crates.io" forge. It basically fetches https://github.com/rust-lang/crates.io-index.git to a temp directory and then walks through each file to get the crate's info. """ # Part of the lister API, that identifies this lister LISTER_NAME = "crates" # (Optional) CVS type of the origins listed by this lister, if constant - VISIT_TYPE = "rust-crate" + VISIT_TYPE = "crates" INSTANCE = "crates" INDEX_REPOSITORY_URL = "https://github.com/rust-lang/crates.io-index.git" DESTINATION_PATH = Path("/tmp/crates.io-index") CRATE_FILE_URL_PATTERN = ( "https://static.crates.io/crates/{crate}/{crate}-{version}.crate" ) + CRATE_API_URL_PATTERN = "https://crates.io/api/v1/crates/{crate}" def __init__( self, scheduler: SchedulerInterface, credentials: CredentialsType = None, ): super().__init__( scheduler=scheduler, credentials=credentials, url=self.INDEX_REPOSITORY_URL, instance=self.INSTANCE, ) def get_index_repository(self) -> None: """Get crates.io-index repository up to date running git command.""" subprocess.check_call( [ "git", "clone", self.INDEX_REPOSITORY_URL, self.DESTINATION_PATH, ] ) def get_crates_index(self) -> List[Path]: """Build a sorted list of file paths excluding dotted directories and dotted files. Each file path corresponds to a crate that lists all available versions. """ crates_index = sorted( path for path in self.DESTINATION_PATH.rglob("*") if not any(part.startswith(".") for part in path.parts) and path.is_file() and path != self.DESTINATION_PATH / "config.json" ) return crates_index def get_pages(self) -> Iterator[CratesListerPage]: """Yield an iterator sorted by name in ascending order of pages. Each page is a list of crate versions with: - name: Name of the crate - version: Version - checksum: Checksum - crate_file: Url of the crate file - last_update: Date of the last commit of the corresponding index file """ # Fetch crates.io index repository self.get_index_repository() # Get a list of all crates files from the index repository crates_index = self.get_crates_index() logger.debug("found %s crates in crates_index", len(crates_index)) for crate in crates_index: page = [] # %cI is for strict iso8601 date formatting last_update_str = subprocess.check_output( ["git", "log", "-1", "--pretty=format:%cI", str(crate)], cwd=self.DESTINATION_PATH, ) last_update = iso8601.parse_date(last_update_str.decode().strip()) with crate.open("rb") as current_file: for line in current_file: data = json.loads(line) # pick only the data we need page.append( dict( name=data["name"], version=data["vers"], checksum=data["cksum"], crate_file=self.CRATE_FILE_URL_PATTERN.format( crate=data["name"], version=data["vers"] ), last_update=last_update, ) ) yield page def get_origins_from_page(self, page: CratesListerPage) -> Iterator[ListedOrigin]: """Iterate on all crate pages and yield ListedOrigin instances.""" assert self.lister_obj.id is not None + url = self.CRATE_API_URL_PATTERN.format(crate=page[0]["name"]) + last_update = page[0]["last_update"] + artifacts = [] + for version in page: - yield ListedOrigin( - lister_id=self.lister_obj.id, - visit_type=self.VISIT_TYPE, - url=version["crate_file"], - last_update=version["last_update"], - extra_loader_arguments={ - "name": version["name"], - "version": version["version"], - "checksum": version["checksum"], + filename = urlparse(version["crate_file"]).path.split("/")[-1] + # Build an artifact entry following original-artifacts-json specification + # https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json # noqa: B950 + artifact = { + "filename": f"{filename}", + "checksums": { + "sha256": f"{version['checksum']}", }, - ) + "url": version["crate_file"], + "version": version["version"], + } + artifacts.append(artifact) + + yield ListedOrigin( + lister_id=self.lister_obj.id, + visit_type=self.VISIT_TYPE, + url=url, + last_update=last_update, + extra_loader_arguments={ + "artifacts": artifacts, + }, + ) diff --git a/swh/lister/crates/tests/test_lister.py b/swh/lister/crates/tests/test_lister.py index b92ce56..bbb1c7d 100644 --- a/swh/lister/crates/tests/test_lister.py +++ b/swh/lister/crates/tests/test_lister.py @@ -1,89 +1,114 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from pathlib import Path from swh.lister.crates.lister import CratesLister from swh.lister.crates.tests import prepare_repository_from_archive expected_origins = [ { - "name": "rand", - "version": "0.1.1", - "checksum": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", - "url": "https://static.crates.io/crates/rand/rand-0.1.1.crate", + "url": "https://crates.io/api/v1/crates/rand", + "artifacts": [ + { + "checksums": { + "sha256": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", # noqa: B950 + }, + "filename": "rand-0.1.1.crate", + "url": "https://static.crates.io/crates/rand/rand-0.1.1.crate", + "version": "0.1.1", + }, + { + "checksums": { + "sha256": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", # noqa: B950 + }, + "filename": "rand-0.1.2.crate", + "url": "https://static.crates.io/crates/rand/rand-0.1.2.crate", + "version": "0.1.2", + }, + ], }, { - "name": "rand", - "version": "0.1.2", - "checksum": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", - "url": "https://static.crates.io/crates/rand/rand-0.1.2.crate", + "url": "https://crates.io/api/v1/crates/regex", + "artifacts": [ + { + "checksums": { + "sha256": "f0ff1ca641d3c9a2c30464dac30183a8b91cdcc959d616961be020cdea6255c5", # noqa: B950 + }, + "filename": "regex-0.1.0.crate", + "url": "https://static.crates.io/crates/regex/regex-0.1.0.crate", + "version": "0.1.0", + }, + { + "checksums": { + "sha256": "a07bef996bd38a73c21a8e345d2c16848b41aa7ec949e2fedffe9edf74cdfb36", # noqa: B950 + }, + "filename": "regex-0.1.1.crate", + "url": "https://static.crates.io/crates/regex/regex-0.1.1.crate", + "version": "0.1.1", + }, + { + "checksums": { + "sha256": "343bd0171ee23346506db6f4c64525de6d72f0e8cc533f83aea97f3e7488cbf9", # noqa: B950 + }, + "filename": "regex-0.1.2.crate", + "url": "https://static.crates.io/crates/regex/regex-0.1.2.crate", + "version": "0.1.2", + }, + { + "checksums": { + "sha256": "defb220c4054ca1b95fe8b0c9a6e782dda684c1bdf8694df291733ae8a3748e3", # noqa: B950 + }, + "filename": "regex-0.1.3.crate", + "url": "https://static.crates.io/crates/regex/regex-0.1.3.crate", + "version": "0.1.3", + }, + ], }, { - "name": "regex", - "version": "0.1.0", - "checksum": "f0ff1ca641d3c9a2c30464dac30183a8b91cdcc959d616961be020cdea6255c5", - "url": "https://static.crates.io/crates/regex/regex-0.1.0.crate", - }, - { - "name": "regex", - "version": "0.1.1", - "checksum": "a07bef996bd38a73c21a8e345d2c16848b41aa7ec949e2fedffe9edf74cdfb36", - "url": "https://static.crates.io/crates/regex/regex-0.1.1.crate", - }, - { - "name": "regex", - "version": "0.1.2", - "checksum": "343bd0171ee23346506db6f4c64525de6d72f0e8cc533f83aea97f3e7488cbf9", - "url": "https://static.crates.io/crates/regex/regex-0.1.2.crate", - }, - { - "name": "regex", - "version": "0.1.3", - "checksum": "defb220c4054ca1b95fe8b0c9a6e782dda684c1bdf8694df291733ae8a3748e3", - "url": "https://static.crates.io/crates/regex/regex-0.1.3.crate", - }, - { - "name": "regex-syntax", - "version": "0.1.0", - "checksum": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", - "url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", + "url": "https://crates.io/api/v1/crates/regex-syntax", + "artifacts": [ + { + "checksums": { + "sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", # noqa: B950 + }, + "filename": "regex-syntax-0.1.0.crate", + "url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", + "version": "0.1.0", + }, + ], }, ] def test_crates_lister(datadir, tmp_path, swh_scheduler): archive_path = Path(datadir, "fake-crates-repository.tar.gz") repo_url = prepare_repository_from_archive( archive_path, "crates.io-index", tmp_path ) lister = CratesLister(scheduler=swh_scheduler) lister.INDEX_REPOSITORY_URL = repo_url lister.DESTINATION_PATH = tmp_path.parent / "crates.io-index-tests" res = lister.run() assert res.pages == 3 - assert res.origins == 7 + assert res.origins == 3 expected_origins_sorted = sorted(expected_origins, key=lambda x: x.get("url")) scheduler_origins_sorted = sorted( swh_scheduler.get_listed_origins(lister.lister_obj.id).results, key=lambda x: x.url, ) for scheduled, expected in zip(scheduler_origins_sorted, expected_origins_sorted): - assert scheduled.visit_type == "rust-crate" + assert scheduled.visit_type == "crates" assert scheduled.url == expected.get("url") - assert scheduled.extra_loader_arguments.get("name") == expected.get("name") - assert scheduled.extra_loader_arguments.get("version") == expected.get( - "version" - ) - assert scheduled.extra_loader_arguments.get("checksum") == expected.get( - "checksum" + assert scheduled.extra_loader_arguments.get("artifacts") == expected.get( + "artifacts" ) assert len(scheduler_origins_sorted) == len(expected_origins_sorted) diff --git a/swh/lister/maven/lister.py b/swh/lister/maven/lister.py index dce3fd2..bc1c2b6 100644 --- a/swh/lister/maven/lister.py +++ b/swh/lister/maven/lister.py @@ -1,378 +1,390 @@ # Copyright (C) 2021-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from dataclasses import asdict, dataclass from datetime import datetime, timezone import logging import re from typing import Any, Dict, Iterator, Optional from urllib.parse import urljoin import requests from tenacity.before_sleep import before_sleep_log import xmltodict from swh.lister.utils import throttling_retry from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from .. import USER_AGENT from ..pattern import CredentialsType, Lister logger = logging.getLogger(__name__) RepoPage = Dict[str, Any] @dataclass class MavenListerState: """State of the MavenLister""" last_seen_doc: int = -1 """Last doc ID ingested during an incremental pass """ last_seen_pom: int = -1 """Last doc ID related to a pom and ingested during an incremental pass """ class MavenLister(Lister[MavenListerState, RepoPage]): """List origins from a Maven repository. Maven Central provides artifacts for Java builds. It includes POM files and source archives, which we download to get the source code of artifacts and links to their scm repository. This lister yields origins of types: git/svn/hg or whatever the Artifacts use as repository type, plus maven types for the maven loader (tgz, jar).""" LISTER_NAME = "maven" def __init__( self, scheduler: SchedulerInterface, url: str, index_url: str = None, instance: Optional[str] = None, credentials: CredentialsType = None, incremental: bool = True, ): """Lister class for Maven repositories. Args: url: main URL of the Maven repository, i.e. url of the base index used to fetch maven artifacts. For Maven central use https://repo1.maven.org/maven2/ index_url: the URL to download the exported text indexes from. Would typically be a local host running the export docker image. See README.md in this directory for more information. instance: Name of maven instance. Defaults to url's network location if unset. incremental: bool, defaults to True. Defines if incremental listing is activated or not. """ self.BASE_URL = url self.INDEX_URL = index_url self.incremental = incremental super().__init__( scheduler=scheduler, credentials=credentials, url=url, instance=instance, ) self.session = requests.Session() self.session.headers.update( { "Accept": "application/json", "User-Agent": USER_AGENT, } ) + self.jar_origins: Dict[str, ListedOrigin] = {} + def state_from_dict(self, d: Dict[str, Any]) -> MavenListerState: return MavenListerState(**d) def state_to_dict(self, state: MavenListerState) -> Dict[str, Any]: return asdict(state) @throttling_retry(before_sleep=before_sleep_log(logger, logging.WARNING)) def page_request(self, url: str, params: Dict[str, Any]) -> requests.Response: logger.info("Fetching URL %s with params %s", url, params) response = self.session.get(url, params=params) if response.status_code != 200: logger.warning( "Unexpected HTTP status code %s on %s: %s", response.status_code, response.url, response.content, ) response.raise_for_status() return response def get_pages(self) -> Iterator[RepoPage]: """Retrieve and parse exported maven indexes to identify all pom files and src archives. """ # Example of returned RepoPage's: # [ # { # "type": "maven", # "url": "https://maven.xwiki.org/..-5.4.2-sources.jar", # "time": 1626109619335, # "gid": "org.xwiki.platform", # "aid": "xwiki-platform-wikistream-events-xwiki", # "version": "5.4.2" # }, # { # "type": "scm", # "url": "scm:git:git://github.com/openengsb/openengsb-framework.git", # "project": "openengsb-framework", # }, # ... # ] # Download the main text index file. logger.info("Downloading computed index from %s.", self.INDEX_URL) assert self.INDEX_URL is not None response = requests.get(self.INDEX_URL, stream=True) if response.status_code != 200: logger.error("Index %s not found, stopping", self.INDEX_URL) response.raise_for_status() # Prepare regexes to parse index exports. # Parse doc id. # Example line: "doc 13" re_doc = re.compile(r"^doc (?P\d+)$") # Parse gid, aid, version, classifier, extension. # Example line: " value al.aldi|sprova4j|0.1.0|sources|jar" re_val = re.compile( r"^\s{4}value (?P[^|]+)\|(?P[^|]+)\|(?P[^|]+)\|" + r"(?P[^|]+)\|(?P[^|]+)$" ) # Parse last modification time. # Example line: " value jar|1626109619335|14316|2|2|0|jar" re_time = re.compile( r"^\s{4}value ([^|]+)\|(?P[^|]+)\|([^|]+)\|([^|]+)\|([^|]+)" + r"\|([^|]+)\|([^|]+)$" ) # Read file line by line and process it out_pom: Dict = {} jar_src: Dict = {} doc_id: int = 0 jar_src["doc"] = None url_src = None iterator = response.iter_lines(chunk_size=1024) for line_bytes in iterator: # Read the index text export and get URLs and SCMs. line = line_bytes.decode(errors="ignore") m_doc = re_doc.match(line) if m_doc is not None: doc_id = int(m_doc.group("doc")) - if ( - self.incremental - and self.state - and self.state.last_seen_doc - and self.state.last_seen_doc >= doc_id - ): - # jar_src["doc"] contains the id of the current document, whatever - # its type (scm or jar). - jar_src["doc"] = None - else: - jar_src["doc"] = doc_id + # jar_src["doc"] contains the id of the current document, whatever + # its type (scm or jar). + jar_src["doc"] = doc_id else: - # If incremental mode, we don't record any line that is - # before our last recorded doc id. - if self.incremental and jar_src["doc"] is None: - continue m_val = re_val.match(line) if m_val is not None: (gid, aid, version, classifier, ext) = m_val.groups() ext = ext.strip() path = "/".join(gid.split(".")) if classifier == "NA" and ext.lower() == "pom": # If incremental mode, we don't record any line that is # before our last recorded doc id. if ( self.incremental and self.state and self.state.last_seen_pom and self.state.last_seen_pom >= doc_id ): continue url_path = f"{path}/{aid}/{version}/{aid}-{version}.{ext}" url_pom = urljoin( self.BASE_URL, url_path, ) out_pom[url_pom] = doc_id elif ( classifier.lower() == "sources" or ("src" in classifier) ) and ext.lower() in ("zip", "jar"): url_path = ( f"{path}/{aid}/{version}/{aid}-{version}-{classifier}.{ext}" ) url_src = urljoin(self.BASE_URL, url_path) jar_src["gid"] = gid jar_src["aid"] = aid jar_src["version"] = version else: m_time = re_time.match(line) if m_time is not None and url_src is not None: time = m_time.group("mtime") jar_src["time"] = int(time) artifact_metadata_d = { "type": "maven", "url": url_src, **jar_src, } logger.debug( "* Yielding jar %s: %s", url_src, artifact_metadata_d ) yield artifact_metadata_d url_src = None logger.info("Found %s poms.", len(out_pom)) # Now fetch pom files and scan them for scm info. logger.info("Fetching poms..") for pom in out_pom: try: response = self.page_request(pom, {}) project = xmltodict.parse(response.content.decode()) project_d = project.get("project", {}) scm_d = project_d.get("scm") if scm_d is not None: connection = scm_d.get("connection") if connection is not None: - scm = connection - gid = project_d["groupId"] - aid = project_d["artifactId"] artifact_metadata_d = { "type": "scm", "doc": out_pom[pom], - "url": scm, - "project": f"{gid}.{aid}", + "url": connection, } logger.debug("* Yielding pom %s: %s", pom, artifact_metadata_d) yield artifact_metadata_d else: logger.debug("No scm.connection in pom %s", pom) else: logger.debug("No scm in pom %s", pom) except requests.HTTPError: logger.warning( "POM info page could not be fetched, skipping project '%s'", pom, ) except xmltodict.expat.ExpatError as error: logger.info("Could not parse POM %s XML: %s. Next.", pom, error) def get_origins_from_page(self, page: RepoPage) -> Iterator[ListedOrigin]: """Convert a page of Maven repositories into a list of ListedOrigins.""" assert self.lister_obj.id is not None scm_types_ok = ("git", "svn", "hg", "cvs", "bzr") if page["type"] == "scm": # If origin is a scm url: detect scm type and yield. # Note that the official format is: # scm:git:git://github.com/openengsb/openengsb-framework.git # but many, many projects directly put the repo url, so we have to # detect the content to match it properly. m_scm = re.match(r"^scm:(?P[^:]+):(?P.*)$", page["url"]) if m_scm is not None: scm_type = m_scm.group("type") if scm_type in scm_types_ok: scm_url = m_scm.group("url") origin = ListedOrigin( lister_id=self.lister_obj.id, url=scm_url, visit_type=scm_type, ) yield origin else: if page["url"].endswith(".git"): origin = ListedOrigin( lister_id=self.lister_obj.id, url=page["url"], visit_type="git", ) yield origin else: - # Origin is a source archive: + # Origin is gathering source archives: last_update_dt = None last_update_iso = "" - last_update_seconds = str(page["time"])[:-3] try: + last_update_seconds = str(page["time"])[:-3] last_update_dt = datetime.fromtimestamp(int(last_update_seconds)) - last_update_dt_tz = last_update_dt.astimezone(timezone.utc) - except OverflowError: + last_update_dt = last_update_dt.astimezone(timezone.utc) + except (OverflowError, ValueError): logger.warning("- Failed to convert datetime %s.", last_update_seconds) if last_update_dt: - last_update_iso = last_update_dt_tz.isoformat() - origin = ListedOrigin( - lister_id=self.lister_obj.id, - url=page["url"], - visit_type=page["type"], - last_update=last_update_dt_tz, - extra_loader_arguments={ - "artifacts": [ - { - "time": last_update_iso, - "gid": page["gid"], - "aid": page["aid"], - "version": page["version"], - "base_url": self.BASE_URL, - } - ] - }, - ) - yield origin + last_update_iso = last_update_dt.isoformat() + + # Origin URL will target page holding sources for all versions of + # an artifactId (package name) inside a groupId (namespace) + path = "/".join(page["gid"].split(".")) + origin_url = urljoin(self.BASE_URL, f"{path}/{page['aid']}") + + artifact = { + **{k: v for k, v in page.items() if k != "doc"}, + "time": last_update_iso, + "base_url": self.BASE_URL, + } + + if origin_url not in self.jar_origins: + # Create ListedOrigin instance if we did not see that origin yet + jar_origin = ListedOrigin( + lister_id=self.lister_obj.id, + url=origin_url, + visit_type=page["type"], + last_update=last_update_dt, + extra_loader_arguments={"artifacts": [artifact]}, + ) + self.jar_origins[origin_url] = jar_origin + else: + # Update list of source artifacts for that origin otherwise + jar_origin = self.jar_origins[origin_url] + artifacts = jar_origin.extra_loader_arguments["artifacts"] + if artifact not in artifacts: + artifacts.append(artifact) + + if ( + jar_origin.last_update + and last_update_dt + and last_update_dt > jar_origin.last_update + ): + jar_origin.last_update = last_update_dt + + if not self.incremental or ( + self.state and page["doc"] > self.state.last_seen_doc + ): + # Yield origin with updated source artifacts, multiple instances of + # ListedOrigin for the same origin URL but with different artifacts + # list will be sent to the scheduler but it will deduplicate them and + # take the latest one to upsert in database + yield jar_origin def commit_page(self, page: RepoPage) -> None: """Update currently stored state using the latest listed doc. Note: this is a noop for full listing mode """ if self.incremental and self.state: # We need to differentiate the two state counters according # to the type of origin. if page["type"] == "maven" and page["doc"] > self.state.last_seen_doc: self.state.last_seen_doc = page["doc"] elif page["type"] == "scm" and page["doc"] > self.state.last_seen_pom: self.state.last_seen_doc = page["doc"] self.state.last_seen_pom = page["doc"] def finalize(self) -> None: """Finalize the lister state, set update if any progress has been made. Note: this is a noop for full listing mode """ if self.incremental and self.state: last_seen_doc = self.state.last_seen_doc last_seen_pom = self.state.last_seen_pom scheduler_state = self.get_state_from_scheduler() if last_seen_doc and last_seen_pom: if (scheduler_state.last_seen_doc < last_seen_doc) or ( scheduler_state.last_seen_pom < last_seen_pom ): self.updated = True diff --git a/swh/lister/maven/tests/data/http_indexes/export.fld b/swh/lister/maven/tests/data/http_indexes/export.fld deleted file mode 100755 index c8e64b0..0000000 --- a/swh/lister/maven/tests/data/http_indexes/export.fld +++ /dev/null @@ -1,113 +0,0 @@ -doc 0 - field 0 - name u - type string - value al.aldi|sprova4j|0.1.0|sources|jar - field 1 - name m - type string - value 1626111735737 - field 2 - name i - type string - value jar|1626109619335|14316|2|2|0|jar - field 10 - name n - type string - value sprova4j - field 11 - name d - type string - value Java client for Sprova Test Management -doc 1 - field 0 - name u - type string - value al.aldi|sprova4j|0.1.0|NA|pom - field 1 - name m - type string - value 1626111735764 - field 2 - name i - type string - value jar|1626109636636|-1|1|0|0|pom - field 10 - name n - type string - value sprova4j - field 11 - name d - type string - value Java client for Sprova Test Management -doc 2 - field 0 - name u - type string - value al.aldi|sprova4j|0.1.1|sources|jar - field 1 - name m - type string - value 1626111784883 - field 2 - name i - type string - value jar|1626111425534|14510|2|2|0|jar - field 10 - name n - type string - value sprova4j - field 11 - name d - type string - value Java client for Sprova Test Management -doc 3 - field 0 - name u - type string - value al.aldi|sprova4j|0.1.1|NA|pom - field 1 - name m - type string - value 1626111784915 - field 2 - name i - type string - value jar|1626111437014|-1|1|0|0|pom - field 10 - name n - type string - value sprova4j - field 11 - name d - type string - value Java client for Sprova Test Management -doc 4 - field 14 - name DESCRIPTOR - type string - value NexusIndex - field 15 - name IDXINFO - type string - value 1.0|index -doc 5 - field 16 - name allGroups - type string - value allGroups - field 17 - name allGroupsList - type string - value al.aldi -doc 6 - field 18 - name rootGroups - type string - value rootGroups - field 19 - name rootGroupsList - type string - value al -END -checksum 00000000003321211082 diff --git a/swh/lister/maven/tests/data/http_indexes/export_incr.fld b/swh/lister/maven/tests/data/http_indexes/export_full.fld old mode 100755 new mode 100644 similarity index 100% rename from swh/lister/maven/tests/data/http_indexes/export_incr.fld rename to swh/lister/maven/tests/data/http_indexes/export_full.fld diff --git a/swh/lister/maven/tests/data/http_indexes/export_incr_first.fld b/swh/lister/maven/tests/data/http_indexes/export_incr_first.fld new file mode 100644 index 0000000..c943c2f --- /dev/null +++ b/swh/lister/maven/tests/data/http_indexes/export_incr_first.fld @@ -0,0 +1,42 @@ +doc 0 + field 0 + name u + type string + value al.aldi|sprova4j|0.1.0|sources|jar + field 1 + name m + type string + value 1633786348254 + field 2 + name i + type string + value jar|1626109619335|14316|2|2|0|jar + field 10 + name n + type string + value sprova4j + field 11 + name d + type string + value Java client for Sprova Test Management +doc 1 + field 0 + name u + type string + value al.aldi|sprova4j|0.1.0|NA|pom + field 1 + name m + type string + value 1633786348271 + field 2 + name i + type string + value jar|1626109636636|-1|1|0|0|pom + field 10 + name n + type string + value sprova4j + field 11 + name d + type string + value Java client for Sprova Test Management diff --git a/swh/lister/maven/tests/data/http_indexes/export_null_mtime.fld b/swh/lister/maven/tests/data/http_indexes/export_null_mtime.fld new file mode 100644 index 0000000..7798a5b --- /dev/null +++ b/swh/lister/maven/tests/data/http_indexes/export_null_mtime.fld @@ -0,0 +1,21 @@ +doc 0 + field 0 + name u + type string + value al.aldi|sprova4j|0.1.0|sources|jar + field 1 + name m + type string + value 1633786348254 + field 2 + name i + type string + value jar|0|14316|2|2|0|jar + field 10 + name n + type string + value sprova4j + field 11 + name d + type string + value Java client for Sprova Test Management diff --git a/swh/lister/maven/tests/test_lister.py b/swh/lister/maven/tests/test_lister.py index 267da95..d8e30ab 100644 --- a/swh/lister/maven/tests/test_lister.py +++ b/swh/lister/maven/tests/test_lister.py @@ -1,327 +1,319 @@ -# Copyright (C) 2021 The Software Heritage developers +# Copyright (C) 2021-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -from datetime import timezone from pathlib import Path import iso8601 import pytest import requests from swh.lister.maven.lister import MavenLister MVN_URL = "https://repo1.maven.org/maven2/" # main maven repo url INDEX_URL = "http://indexes/export.fld" # index directory url URL_POM_1 = MVN_URL + "al/aldi/sprova4j/0.1.0/sprova4j-0.1.0.pom" URL_POM_2 = MVN_URL + "al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom" URL_POM_3 = MVN_URL + "com/arangodb/arangodb-graphql/1.2/arangodb-graphql-1.2.pom" LIST_GIT = ( "git://github.com/aldialimucaj/sprova4j.git", "https://github.com/aldialimucaj/sprova4j.git", ) LIST_GIT_INCR = ("git://github.com/ArangoDB-Community/arangodb-graphql-java.git",) -LIST_SRC = ( - MVN_URL + "al/aldi/sprova4j/0.1.0/sprova4j-0.1.0-sources.jar", - MVN_URL + "al/aldi/sprova4j/0.1.1/sprova4j-0.1.1-sources.jar", -) +LIST_SRC = (MVN_URL + "al/aldi/sprova4j",) LIST_SRC_DATA = ( { "type": "maven", "url": "https://repo1.maven.org/maven2/al/aldi/sprova4j" + "/0.1.0/sprova4j-0.1.0-sources.jar", "time": "2021-07-12T17:06:59+00:00", "gid": "al.aldi", "aid": "sprova4j", "version": "0.1.0", + "base_url": MVN_URL, }, { "type": "maven", "url": "https://repo1.maven.org/maven2/al/aldi/sprova4j" + "/0.1.1/sprova4j-0.1.1-sources.jar", "time": "2021-07-12T17:37:05+00:00", "gid": "al.aldi", "aid": "sprova4j", "version": "0.1.1", + "base_url": MVN_URL, }, ) @pytest.fixture -def maven_index(datadir) -> str: - return Path(datadir, "http_indexes", "export.fld").read_text() +def maven_index_full(datadir) -> str: + return Path(datadir, "http_indexes", "export_full.fld").read_text() @pytest.fixture -def maven_index_incr(datadir) -> str: - return Path(datadir, "http_indexes", "export_incr.fld").read_text() +def maven_index_incr_first(datadir) -> str: + return Path(datadir, "http_indexes", "export_incr_first.fld").read_text() @pytest.fixture def maven_pom_1(datadir) -> str: return Path(datadir, "https_maven.org", "sprova4j-0.1.0.pom").read_text() +@pytest.fixture +def maven_index_null_mtime(datadir) -> str: + return Path(datadir, "http_indexes", "export_null_mtime.fld").read_text() + + @pytest.fixture def maven_pom_1_malformed(datadir) -> str: return Path(datadir, "https_maven.org", "sprova4j-0.1.0.malformed.pom").read_text() @pytest.fixture def maven_pom_2(datadir) -> str: return Path(datadir, "https_maven.org", "sprova4j-0.1.1.pom").read_text() @pytest.fixture def maven_pom_3(datadir) -> str: return Path(datadir, "https_maven.org", "arangodb-graphql-1.2.pom").read_text() -def test_maven_full_listing( - swh_scheduler, - requests_mock, - mocker, - maven_index, - maven_pom_1, - maven_pom_2, +@pytest.fixture(autouse=True) +def network_requests_mock( + requests_mock, maven_index_full, maven_pom_1, maven_pom_2, maven_pom_3 ): + requests_mock.get(INDEX_URL, text=maven_index_full) + requests_mock.get(URL_POM_1, text=maven_pom_1) + requests_mock.get(URL_POM_2, text=maven_pom_2) + requests_mock.get(URL_POM_3, text=maven_pom_3) + + +def test_maven_full_listing(swh_scheduler): """Covers full listing of multiple pages, checking page results and listed origins, statelessness.""" + # Run the lister. lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=False, ) - # Set up test. - index_text = maven_index - requests_mock.get(INDEX_URL, text=index_text) - requests_mock.get(URL_POM_1, text=maven_pom_1) - requests_mock.get(URL_POM_2, text=maven_pom_2) - - # Then run the lister. stats = lister.run() # Start test checks. - assert stats.pages == 4 - assert stats.origins == 4 + assert stats.pages == 5 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - origin_urls = [origin.url for origin in scheduler_origins] - assert sorted(origin_urls) == sorted(LIST_GIT + LIST_SRC) + + # 3 git origins + 1 maven origin with 2 releases (one per jar) + assert len(origin_urls) == 4 + assert sorted(origin_urls) == sorted(LIST_GIT + LIST_GIT_INCR + LIST_SRC) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: - if src.get("url") == origin.url: - last_update_src = iso8601.parse_date(src.get("time")).astimezone( - tz=timezone.utc - ) - assert last_update_src == origin.last_update - artifact = origin.extra_loader_arguments["artifacts"][0] - assert src.get("time") == artifact["time"] - assert src.get("gid") == artifact["gid"] - assert src.get("aid") == artifact["aid"] - assert src.get("version") == artifact["version"] - assert MVN_URL == artifact["base_url"] - break - else: - raise AssertionError( - "Could not find scheduler origin in referenced origins." - ) + last_update_src = iso8601.parse_date(src["time"]) + assert last_update_src <= origin.last_update + assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) + scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == -1 assert scheduler_state.last_seen_pom == -1 def test_maven_full_listing_malformed( swh_scheduler, requests_mock, - mocker, - maven_index, maven_pom_1_malformed, - maven_pom_2, ): """Covers full listing of multiple pages, checking page results with a malformed scm entry in pom.""" lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=False, ) # Set up test. - index_text = maven_index - requests_mock.get(INDEX_URL, text=index_text) requests_mock.get(URL_POM_1, text=maven_pom_1_malformed) - requests_mock.get(URL_POM_2, text=maven_pom_2) # Then run the lister. stats = lister.run() # Start test checks. - assert stats.pages == 4 - assert stats.origins == 3 + assert stats.pages == 5 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - origin_urls = [origin.url for origin in scheduler_origins] - LIST_SRC_1 = ("https://github.com/aldialimucaj/sprova4j.git",) - assert sorted(origin_urls) == sorted(LIST_SRC_1 + LIST_SRC) + + # 2 git origins + 1 maven origin with 2 releases (one per jar) + assert len(origin_urls) == 3 + assert sorted(origin_urls) == sorted((LIST_GIT[1],) + LIST_GIT_INCR + LIST_SRC) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: - if src.get("url") == origin.url: - artifact = origin.extra_loader_arguments["artifacts"][0] - assert src.get("time") == artifact["time"] - assert src.get("gid") == artifact["gid"] - assert src.get("aid") == artifact["aid"] - assert src.get("version") == artifact["version"] - assert MVN_URL == artifact["base_url"] - break - else: - raise AssertionError( - "Could not find scheduler origin in referenced origins." - ) + last_update_src = iso8601.parse_date(src["time"]) + assert last_update_src <= origin.last_update + assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) + scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == -1 assert scheduler_state.last_seen_pom == -1 def test_maven_incremental_listing( swh_scheduler, requests_mock, - mocker, - maven_index, - maven_index_incr, - maven_pom_1, - maven_pom_2, - maven_pom_3, + maven_index_full, + maven_index_incr_first, ): """Covers full listing of multiple pages, checking page results and listed origins, with a second updated run for statefulness.""" lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=True, ) # Set up test. - requests_mock.get(INDEX_URL, text=maven_index) - requests_mock.get(URL_POM_1, text=maven_pom_1) - requests_mock.get(URL_POM_2, text=maven_pom_2) + requests_mock.get(INDEX_URL, text=maven_index_incr_first) # Then run the lister. stats = lister.run() # Start test checks. assert lister.incremental assert lister.updated - assert stats.pages == 4 - assert stats.origins == 4 + assert stats.pages == 2 + + scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results + origin_urls = [origin.url for origin in scheduler_origins] + + # 1 git origins + 1 maven origin with 1 release (one per jar) + assert len(origin_urls) == 2 + assert sorted(origin_urls) == sorted((LIST_GIT[0],) + LIST_SRC) + + for origin in scheduler_origins: + if origin.visit_type == "maven": + last_update_src = iso8601.parse_date(LIST_SRC_DATA[0]["time"]) + assert last_update_src == origin.last_update + assert origin.extra_loader_arguments["artifacts"] == [LIST_SRC_DATA[0]] # Second execution of the lister, incremental mode lister = MavenLister( scheduler=swh_scheduler, url=MVN_URL, instance="maven.org", index_url=INDEX_URL, incremental=True, ) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None - assert scheduler_state.last_seen_doc == 3 - assert scheduler_state.last_seen_pom == 3 + assert scheduler_state.last_seen_doc == 1 + assert scheduler_state.last_seen_pom == 1 # Set up test. - requests_mock.get(INDEX_URL, text=maven_index_incr) - requests_mock.get(URL_POM_3, text=maven_pom_3) + requests_mock.get(INDEX_URL, text=maven_index_full) # Then run the lister. stats = lister.run() # Start test checks. assert lister.incremental assert lister.updated - assert stats.pages == 1 - assert stats.origins == 1 + assert stats.pages == 4 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - origin_urls = [origin.url for origin in scheduler_origins] + assert sorted(origin_urls) == sorted(LIST_SRC + LIST_GIT + LIST_GIT_INCR) for origin in scheduler_origins: if origin.visit_type == "maven": for src in LIST_SRC_DATA: - if src.get("url") == origin.url: - artifact = origin.extra_loader_arguments["artifacts"][0] - assert src.get("time") == artifact["time"] - assert src.get("gid") == artifact["gid"] - assert src.get("aid") == artifact["aid"] - assert src.get("version") == artifact["version"] - break - else: - raise AssertionError + last_update_src = iso8601.parse_date(src["time"]) + assert last_update_src <= origin.last_update + assert origin.extra_loader_arguments["artifacts"] == list(LIST_SRC_DATA) scheduler_state = lister.get_state_from_scheduler() assert scheduler_state is not None assert scheduler_state.last_seen_doc == 4 assert scheduler_state.last_seen_pom == 4 @pytest.mark.parametrize("http_code", [400, 404, 500, 502]) -def test_maven_list_http_error_on_index_read( - swh_scheduler, requests_mock, mocker, maven_index, http_code -): +def test_maven_list_http_error_on_index_read(swh_scheduler, requests_mock, http_code): """should stop listing if the lister fails to retrieve the main index url.""" lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) requests_mock.get(INDEX_URL, status_code=http_code) with pytest.raises(requests.HTTPError): # listing cannot continues so stop lister.run() scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 0 @pytest.mark.parametrize("http_code", [400, 404, 500, 502]) def test_maven_list_http_error_artifacts( - swh_scheduler, requests_mock, mocker, maven_index, http_code, maven_pom_2 + swh_scheduler, + requests_mock, + http_code, ): """should continue listing when failing to retrieve artifacts.""" # Test failure of artefacts retrieval. - requests_mock.get(INDEX_URL, text=maven_index) requests_mock.get(URL_POM_1, status_code=http_code) - requests_mock.get(URL_POM_2, text=maven_pom_2) lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL) # on artifacts though, that raises but continue listing lister.run() - # If the maven_index step succeeded but not the get_pom step, - # then we get only the 2 maven-jar origins (and not the 2 additional - # src origins). + # If the maven_index_full step succeeded but not the get_pom step, + # then we get only one maven-jar origin and one git origin. scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 3 + + +def test_maven_lister_null_mtime(swh_scheduler, requests_mock, maven_index_null_mtime): + + requests_mock.get(INDEX_URL, text=maven_index_null_mtime) + + # Run the lister. + lister = MavenLister( + scheduler=swh_scheduler, + url=MVN_URL, + instance="maven.org", + index_url=INDEX_URL, + incremental=False, + ) + + stats = lister.run() + + # Start test checks. + assert stats.pages == 1 + scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results + assert len(scheduler_origins) == 1 + assert scheduler_origins[0].last_update is None diff --git a/tox.ini b/tox.ini index 137d6bb..4c64def 100644 --- a/tox.ini +++ b/tox.ini @@ -1,78 +1,78 @@ [tox] envlist=black,flake8,mypy,py3 [testenv] extras = testing deps = swh.core[http] >= 0.0.61 swh.scheduler[testing] >= 0.5.0 amqp != 5.0.4 pytest-cov dev: ipdb commands = pytest \ !dev: --cov={envsitepackagesdir}/swh/lister/ --cov-branch \ --doctest-modules \ {envsitepackagesdir}/swh/lister/ {posargs} [testenv:black] skip_install = true deps = black==22.3.0 commands = {envpython} -m black --check swh [testenv:flake8] skip_install = true deps = flake8==4.0.1 flake8-bugbear==22.3.23 commands = {envpython} -m flake8 [testenv:mypy] extras = testing deps = - mypy==0.920 + mypy==0.942 commands = mypy swh # build documentation outside swh-environment using the current # git HEAD of swh-docs, is executed on CI for each diff to prevent # breaking doc build [testenv:sphinx] whitelist_externals = make usedevelop = true extras = testing deps = # fetch and install swh-docs in develop mode -e git+https://forge.softwareheritage.org/source/swh-docs#egg=swh.docs setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = make -I ../.tox/sphinx/src/swh-docs/swh/ -C docs # build documentation only inside swh-environment using local state # of swh-docs package [testenv:sphinx-dev] whitelist_externals = make usedevelop = true extras = testing deps = # install swh-docs in develop mode -e ../swh-docs setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = make -I ../.tox/sphinx-dev/src/swh-docs/swh/ -C docs