diff --git a/PKG-INFO b/PKG-INFO index 09fe8d7..a1e696e 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,125 +1,125 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 4.0.1 +Version: 4.1.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.golang` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` - `swh.lister.gogs` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `golang`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/debian/changelog b/debian/changelog index 1ea9eed..c920602 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,1257 +1,1267 @@ -swh-lister (4.0.1-1~swh1~bpo10+1) buster-swh; urgency=medium +swh-lister (4.1.0-1~swh1) unstable-swh; urgency=medium - * Rebuild for buster-swh - - -- Software Heritage autobuilder (on jenkins-debian1) Mon, 24 Oct 2022 09:53:15 +0000 + * New upstream release 4.1.0 - (tagged by Antoine R. Dumont + (@ardumont) on 2022-10-26 14:35:02 + +0200) + * Upstream changes: - v4.1.0 - nixguix: Use content- + disposition from http head request if provided - nixguix: Deal + with edge case url with version instead of extension - Puppet: + Switch artifacts from dict to list - nixguix: Allow lister to + ignore specific extensions - nixguix/test: Add all supported + tarball extensions to test manifest - conda: Yield listed + origins after all artifacts in a page are processed - Add + support for more tarball recognition based on extensions + + -- Software Heritage autobuilder (on jenkins-debian1) Wed, 26 Oct 2022 12:41:59 +0000 swh-lister (4.0.1-1~swh1) unstable-swh; urgency=medium * New upstream release 4.0.1 - (tagged by Antoine R. Dumont (@ardumont) on 2022-10-24 11:41:49 +0200) * Upstream changes: - v4.0.1 - gogs/lister: Allow public gogs instance listing -- Software Heritage autobuilder (on jenkins-debian1) Mon, 24 Oct 2022 09:50:06 +0000 swh-lister (4.0.0-1~swh1) unstable-swh; urgency=medium * New upstream release 4.0.0 - (tagged by Antoine Lambert on 2022-10-18 12:18:48 +0200) * Upstream changes: - version 4.0.0 -- Software Heritage autobuilder (on jenkins-debian1) Tue, 18 Oct 2022 10:25:55 +0000 swh-lister (3.0.2-1~swh1) unstable-swh; urgency=medium * New upstream release 3.0.2 - (tagged by Vincent SELLIER on 2022-09-20 17:10:16 +0200) * Upstream changes: - v3.0.2 - Changelog: - * 2022-09-20 cgit: Ensure the clone url is searched on the right tab - * 2022- 09-20 gogs: Skip pages with error 500 -- Software Heritage autobuilder (on jenkins-debian1) Tue, 20 Sep 2022 15:17:33 +0000 swh-lister (3.0.1-1~swh1) unstable-swh; urgency=medium * New upstream release 3.0.1 - (tagged by Antoine R. Dumont (@ardumont) on 2022-09-20 11:32:36 +0200) * Upstream changes: - v3.0.1 - golang: Update lister name - arch: Set log level to debug for URL requests - arch: Use tempfile module to create temporary directory - pubdev.lister: Decrease verbosity -- Software Heritage autobuilder (on jenkins-debian1) Tue, 20 Sep 2022 09:39:07 +0000 swh-lister (3.0.0-1~swh2) unstable-swh; urgency=medium * Fix build dependencies and bump new release -- Antoine R. Dumont (@ardumont) Thu, 08 Sep 2022 11:57:38 +0200 swh-lister (3.0.0-1~swh1) unstable-swh; urgency=medium * New upstream release 3.0.0 - (tagged by Antoine R. Dumont (@ardumont) on 2022-09-08 11:19:49 +0200) * Upstream changes: - v3.0.0 - Add new lister pubdev (Dart, Flutter) - Add new lister Arch User Repository (AUR) - Add new lister Golang - Add new lister Bower - Add new lister Gogs - maven: Use BeautifulSoup instead of xmltodict for parsing pom files - crates.lister: Implement incremental mode -- Software Heritage autobuilder (on jenkins-debian1) Thu, 08 Sep 2022 09:27:56 +0000 swh-lister (2.9.3-1~swh1) unstable-swh; urgency=medium * New upstream release 2.9.3 - (tagged by Antoine R. Dumont (@ardumont) on 2022-05-23 15:39:15 +0200) * Upstream changes: - v2.9.3 - Adapt maven lister to list canonical gh urls if any - Use swh.core.github.pytest_plugin in github tests -- Software Heritage autobuilder (on jenkins-debian1) Mon, 23 May 2022 13:47:34 +0000 swh-lister (2.9.2-1~swh1) unstable-swh; urgency=medium * New upstream release 2.9.2 - (tagged by Antoine R. Dumont (@ardumont) on 2022-05-10 10:22:12 +0200) * Upstream changes: - v2.9.2 - maven: Prevent UnicodeDecodeError when processing pom file -- Software Heritage autobuilder (on jenkins-debian1) Tue, 10 May 2022 08:27:22 +0000 swh-lister (2.9.1-1~swh1) unstable-swh; urgency=medium * New upstream release 2.9.1 - (tagged by Antoine R. Dumont (@ardumont) on 2022-04-29 14:45:18 +0200) * Upstream changes: - v2.9.1 - crates: Create one origin per package instead of per version - maven: Handle null mtime value in index for jar archive - maven: Remove extraction of groupId and artifactId from pom files - maven: Create one origin per package instead of one per package version - Bump mypy to v0.942 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 29 Apr 2022 12:50:29 +0000 swh-lister (2.9.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.9.0 - (tagged by Valentin Lorentz on 2022-04-26 11:28:55 +0200) * Upstream changes: - v2.9.0 - * github: Remove dead code - * github: Refactor rate-limiting out of the GitHubLister class - * maven: Remove duplicated code related to setting instance from netloc -- Software Heritage autobuilder (on jenkins-debian1) Tue, 26 Apr 2022 09:34:54 +0000 swh-lister (2.8.2-1~swh1) unstable-swh; urgency=medium * New upstream release 2.8.2 - (tagged by Antoine R. Dumont (@ardumont) on 2022-04-25 12:34:14 +0200) * Upstream changes: - v2.8.2 - sourceforge: Fix listing of bzr projects - sourceforge: Do not consider Attic as a valid CVS module -- Software Heritage autobuilder (on jenkins-debian1) Mon, 25 Apr 2022 10:39:18 +0000 swh-lister (2.8.1-1~swh1) unstable-swh; urgency=medium * New upstream release 2.8.1 - (tagged by Antoine R. Dumont (@ardumont) on 2022-04-14 15:56:17 +0200) * Upstream changes: - v2.8.1 - maven: Fix argument of type 'NoneType' is not iterable -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Apr 2022 14:01:42 +0000 swh-lister (2.8.0-1~swh2) unstable-swh; urgency=medium * Bump new release (fix build dep) -- Antoine R. Dumont (@ardumont) Thu, 14 Apr 2022 14:51:05 +0200 swh-lister (2.8.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.8.0 - (tagged by Antoine R. Dumont (@ardumont) on 2022-04-14 11:42:16 +0200) * Upstream changes: - v2.8.0 - lister: Add new rust crates lister - maven: Continue listing if unable to retrieve pom information - maven: log error message when not able to retrieve the index to read -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Apr 2022 09:50:25 +0000 swh-lister (2.7.2-1~swh1) unstable-swh; urgency=medium * New upstream release 2.7.2 - (tagged by Antoine Lambert on 2022-03-11 13:34:15 +0100) * Upstream changes: - version 2.7.2 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 11 Mar 2022 12:38:38 +0000 swh-lister (2.7.1-1~swh1) unstable-swh; urgency=medium * New upstream release 2.7.1 - (tagged by Antoine R. Dumont (@ardumont) on 2022-02-18 10:42:52 +0100) * Upstream changes: - v2.7.1 - launchpad: Ignore erratic page and continue listing next page -- Software Heritage autobuilder (on jenkins-debian1) Fri, 18 Feb 2022 09:46:37 +0000 swh-lister (2.7.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.7.0 - (tagged by Antoine R. Dumont (@ardumont) on 2022-02-17 13:56:23 +0100) * Upstream changes: - v2.7.0 - launchpad: Allow bzr origins listing - launchpad: Manage unhandled exceptions when listing - sourceforge: Fix origin URLs for CVS projects -- Software Heritage autobuilder (on jenkins-debian1) Thu, 17 Feb 2022 13:02:22 +0000 swh-lister (2.6.4-1~swh1) unstable-swh; urgency=medium * New upstream release 2.6.4 - (tagged by Antoine R. Dumont (@ardumont) on 2022-02-14 16:57:38 +0100) * Upstream changes: - v2.6.4 - sourceforge: fix support for listing bzr origins -- Software Heritage autobuilder (on jenkins-debian1) Mon, 14 Feb 2022 16:01:23 +0000 swh-lister (2.6.3-1~swh1) unstable-swh; urgency=medium * New upstream release 2.6.3 - (tagged by Antoine R. Dumont (@ardumont) on 2022-02-09 17:20:28 +0100) * Upstream changes: - v2.6.3 - maven: Fix last update datetime -- Software Heritage autobuilder (on jenkins-debian1) Wed, 09 Feb 2022 16:24:11 +0000 swh-lister (2.6.2-1~swh1) unstable-swh; urgency=medium * New upstream release 2.6.2 - (tagged by Antoine R. Dumont (@ardumont) on 2022-02-08 10:39:05 +0100) * Upstream changes: - v2.6.2 - Remove no longer needed tenacity workarounds - maven: Fix undef last_update in ListedOrigins. - maven: dismiss origins if they are malformed - e.g. wrong pom scm format, add test. - maven: Let logging instruction do the formatting - maven: Add more debug logging instruction - maven: Pass the base URL of the Maven instance to the loader - docs: Fix ReST syntax and sphinx warnings - Pin mypy and drop type annotations which makes mypy unhappy - requirements-test: Pin pytest to < 7.0.0 -- Software Heritage autobuilder (on jenkins-debian1) Tue, 08 Feb 2022 09:43:37 +0000 swh-lister (2.6.1-1~swh1) unstable-swh; urgency=medium * New upstream release 2.6.1 - (tagged by Antoine Lambert on 2021-12-06 10:47:19 +0100) * Upstream changes: - version 2.6.1 -- Software Heritage autobuilder (on jenkins-debian1) Mon, 06 Dec 2021 09:51:07 +0000 swh-lister (2.6.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.6.0 - (tagged by Antoine Lambert on 2021-12-03 16:17:52 +0100) * Upstream changes: - version 2.6.0 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 03 Dec 2021 15:22:00 +0000 swh-lister (2.5.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.5.0 - (tagged by Antoine Lambert on 2021-12-03 14:44:36 +0100) * Upstream changes: - version 2.5.0 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 03 Dec 2021 13:48:49 +0000 swh-lister (2.4.0-1~swh3) unstable-swh; urgency=medium * Fix changelog error and actual correct release -- Antoine R. Dumont (@ardumont) Fri, 03 Dec 2021 12:45:00 +0100 swh.lister (2.4.0-1~swh2) unstable-swh; urgency=medium * Update missing deps and release -- Antoine R. Dumont (@ardumont) Fri, 03 Dec 2021 12:37:13 +0100 swh-lister (2.4.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.4.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-12-03 12:17:36 +0100) * Upstream changes: - v2.4.0 - debian: Update extra_loader_arguments dict produced ListedOrigin models - debian: Add missing file URIs in lister output - Deduplicate origins in the GitHub lister - lister: Add new maven lister -- Software Heritage autobuilder (on jenkins-debian1) Fri, 03 Dec 2021 11:21:58 +0000 swh-lister (2.3.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.3.0 - (tagged by Valentin Lorentz on 2021-11-10 13:44:49 +0100) * Upstream changes: - v2.3.0 - * cran: Pass the package name to the loader -- Software Heritage autobuilder (on jenkins-debian1) Wed, 10 Nov 2021 13:03:02 +0000 swh-lister (2.2.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.2.0 - (tagged by Antoine Lambert on 2021-10-22 15:16:48 +0200) * Upstream changes: - version 2.2.0 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Oct 2021 13:23:02 +0000 swh-lister (2.1.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.1.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-10-13 10:16:37 +0200) * Upstream changes: - v2.1.0 - Let sourceforge origins be listed "enabled" by default - docs: Add a save forge documentation - docs: Explain task type registering to complete the save forge doc -- Software Heritage autobuilder (on jenkins-debian1) Wed, 13 Oct 2021 08:21:42 +0000 swh-lister (2.0.0-1~swh1) unstable-swh; urgency=medium * New upstream release 2.0.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-09-29 09:21:37 +0200) * Upstream changes: - v2.0.0 - opam: Share opam root directory even on multiple instances -- Software Heritage autobuilder (on jenkins-debian1) Wed, 29 Sep 2021 07:31:03 +0000 swh-lister (1.9.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.9.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-09-21 11:23:23 +0200) * Upstream changes: - v1.9.0 - gnu: Respect the pattern docstring about state initialization - opam: Allow defining where to actually install the opam_root folder - opam: Make the instance optional and derived from the url - opam: Move the state initialization into the get_pages method -- Software Heritage autobuilder (on jenkins-debian1) Tue, 21 Sep 2021 09:29:04 +0000 swh-lister (1.8.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.8.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-09-17 15:44:00 +0200) * Upstream changes: - v1.8.0 - Allow gitlab lister's name to be overridden by task arguments -- Software Heritage autobuilder (on jenkins-debian1) Fri, 17 Sep 2021 13:47:58 +0000 swh-lister (1.7.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.7.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-09-17 13:37:22 +0200) * Upstream changes: - v1.7.0 - gitlab: Allow ingestion of hg_git origins as hg ones (some instance can list tose e.g - foss.heptapod.net) -- Software Heritage autobuilder (on jenkins-debian1) Fri, 17 Sep 2021 11:41:52 +0000 swh-lister (1.6.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.6.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-09-17 10:50:28 +0200) * Upstream changes: - v1.6.0 - gitlab: Allow listing of instances providing multiple vcs_type -- Software Heritage autobuilder (on jenkins-debian1) Fri, 17 Sep 2021 08:55:14 +0000 swh-lister (1.5.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.5.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-07-23 16:28:50 +0200) * Upstream changes: - v1.5.0 - gitlab: Handle HTTP status code 500 when listing projects - gitlab: Update requests query parameters - gitlab: Adapt requests retry policy to consider HTTP 50x status codes - opam: Directly use the --root flag instead of using an env variable - pattern: Use URL network location as instance name when not provided -- Software Heritage autobuilder (on jenkins-debian1) Fri, 23 Jul 2021 14:32:51 +0000 swh-lister (1.4.0-1~swh2) unstable-swh; urgency=medium * Bump new release -- Antoine R. Dumont (@ardumont) Fri, 09 Jul 2021 13:17:00 +0200 swh-lister (1.4.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.4.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-07-09 13:01:04 +0200) * Upstream changes: - v1.4.0 - New Tuleap lister - New Opam lister - Make PyPI lister incremental - Make PyPI lister complete the information on origins -- Software Heritage autobuilder (on jenkins-debian1) Fri, 09 Jul 2021 11:06:37 +0000 swh-lister (1.3.6-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.6 - (tagged by Antoine R. Dumont (@ardumont) on 2021-06-04 11:59:24 +0200) * Upstream changes: - v1.3.6 - sourceforge: use http:// for Mercurial (as workaround) -- Software Heritage autobuilder (on jenkins-debian1) Fri, 04 Jun 2021 10:03:14 +0000 swh-lister (1.3.5-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.5 - (tagged by Antoine R. Dumont (@ardumont) on 2021-06-03 10:22:17 +0200) * Upstream changes: - v1.3.5 - sourceforge: set the protocol for origin urls -- Software Heritage autobuilder (on jenkins-debian1) Thu, 03 Jun 2021 08:26:13 +0000 swh-lister (1.3.4-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.4 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-31 16:54:37 +0200) * Upstream changes: - v1.3.4 - Disable the sourceforge lister origins (so they can be listed) -- Software Heritage autobuilder (on jenkins-debian1) Mon, 31 May 2021 15:08:17 +0000 swh-lister (1.3.3-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.3 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-28 14:18:53 +0200) * Upstream changes: - v1.3.3 - cgit/lister: Fix error when a missing version is not provided -- Software Heritage autobuilder (on jenkins-debian1) Fri, 28 May 2021 12:39:52 +0000 swh-lister (1.3.2-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.2 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-26 12:43:45 +0200) * Upstream changes: - v1.3.2 - sourceforge: retry for all retryable exceptions -- Software Heritage autobuilder (on jenkins-debian1) Wed, 26 May 2021 10:48:22 +0000 swh-lister (1.3.1-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.1 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-19 11:25:59 +0200) * Upstream changes: - v1.3.1 - sourceforge: don't abort on error for project -- Software Heritage autobuilder (on jenkins-debian1) Wed, 19 May 2021 09:30:14 +0000 swh-lister (1.3.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.3.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-07 17:17:50 +0200) * Upstream changes: - v1.3.0 - sourceforge/tasks: Allow incremental listing - sourceforge/lister: Add credentials parameter -- Software Heritage autobuilder (on jenkins-debian1) Fri, 07 May 2021 15:24:27 +0000 swh-lister (1.2.2-1~swh1) unstable-swh; urgency=medium * New upstream release 1.2.2 - (tagged by Antoine Lambert on 2021-05-07 14:43:24 +0200) * Upstream changes: - version 1.2.2 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 07 May 2021 12:50:12 +0000 swh-lister (1.2.1-1~swh1) unstable-swh; urgency=medium * New upstream release 1.2.1 - (tagged by Antoine Lambert on 2021-05-07 14:10:36 +0200) * Upstream changes: - version 1.2.1 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 07 May 2021 12:17:16 +0000 swh-lister (1.2.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.2.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-05-06 15:17:51 +0200) * Upstream changes: - v1.2.0 - Make the SourceForge lister incremental -- Software Heritage autobuilder (on jenkins-debian1) Fri, 07 May 2021 10:43:11 +0000 swh-lister (1.1.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.1.0 - (tagged by Antoine Lambert on 2021-04-29 14:29:27 +0200) * Upstream changes: - version 1.1.0 -- Software Heritage autobuilder (on jenkins-debian1) Thu, 29 Apr 2021 12:33:59 +0000 swh-lister (1.0.0-1~swh1) unstable-swh; urgency=medium * New upstream release 1.0.0 - (tagged by Nicolas Dandrimont on 2021-03-22 10:56:04 +0100) * Upstream changes: - Release swh.lister v1.0.0 - All listers have been rewritten and are ready to be used in production - with the most recent version of the swh.scheduler APIs. -- Software Heritage autobuilder (on jenkins-debian1) Mon, 22 Mar 2021 10:13:35 +0000 swh-lister (0.10.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.10.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-03-01 09:59:16 +0100) * Upstream changes: - v0.10.0 - docs: Add new "howto write a lister tutorial" with unified lister api -- Software Heritage autobuilder (on jenkins-debian1) Mon, 01 Mar 2021 09:01:54 +0000 swh-lister (0.9.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.9.1 - (tagged by Antoine R. Dumont (@ardumont) on 2021-02-08 14:09:27 +0100) * Upstream changes: - v0.9.1 - debian: Update archive mirror URL templates to process -- Software Heritage autobuilder (on jenkins-debian1) Mon, 08 Feb 2021 13:12:05 +0000 swh-lister (0.9.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.9.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-02-08 08:50:07 +0100) * Upstream changes: - v0.9.0 - docs: Update listers execution instructions - cran: Prevent multiple listing of an origin - cran: Add support for parsing date with milliseconds - pypi: Use BeautifulSoup for parsing HTML instead of xmltodict -- Software Heritage autobuilder (on jenkins-debian1) Mon, 08 Feb 2021 07:52:57 +0000 swh-lister (0.8.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.8.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-02-03 11:12:52 +0100) * Upstream changes: - v0.8.0 - packagist: Reimplement lister using new Lister API - gnu: Remove dependency on pytz - Remove no longer used models field in dict returned by register - Remove no longer used legacy Lister API and update CLI options -- Software Heritage autobuilder (on jenkins-debian1) Wed, 03 Feb 2021 10:15:54 +0000 swh-lister (0.7.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.7.1 - (tagged by Vincent SELLIER on 2021-02-01 17:52:33 +0100) * Upstream changes: - v0.7.1 - * cgit: remove the repository urls's trailing / -- Software Heritage autobuilder (on jenkins-debian1) Mon, 01 Feb 2021 16:56:35 +0000 swh-lister (0.7.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.7.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-02-01 09:31:30 +0100) * Upstream changes: - v0.7.0 - pattern: Bump packet split to chunk of 1000 records - cgit: Compute origin urls out of a base git url when provided. - gnu: Reimplement lister using new Lister API -- Software Heritage autobuilder (on jenkins-debian1) Mon, 01 Feb 2021 08:35:14 +0000 swh-lister (0.6.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.6.1 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-29 09:07:21 +0100) * Upstream changes: - v0.6.1 - launchpad: Remove call to dataclasses.asdict on lister state - launchpad: Prevent error due to origin listed twice - Make debian lister constructors compatible with credentials - launchpad/tasks: Fix ping task function name - pattern: Make lister flush regularly origins to scheduler -- Software Heritage autobuilder (on jenkins-debian1) Fri, 29 Jan 2021 08:11:13 +0000 swh-lister (0.6.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.6.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-28 15:48:32 +0100) * Upstream changes: - v0.6.0 - launchpad: Reimplement lister using new Lister API - Make stateless lister constructors compatible with credentials -- Software Heritage autobuilder (on jenkins-debian1) Thu, 28 Jan 2021 14:52:49 +0000 swh-lister (0.5.4-1~swh1) unstable-swh; urgency=medium * New upstream release 0.5.4 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-28 11:23:29 +0100) * Upstream changes: - v0.5.4 - gitlab: Deal with missing or trailing / in url input - tox.ini: Work around build failure due to upstream release -- Software Heritage autobuilder (on jenkins-debian1) Thu, 28 Jan 2021 10:27:59 +0000 swh-lister (0.5.2-1~swh1) unstable-swh; urgency=medium * New upstream release 0.5.2 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-27 17:19:10 +0100) * Upstream changes: - v0.5.2 - test_cli: Drop launchpad lister from the test_get_lister -- Software Heritage autobuilder (on jenkins-debian1) Wed, 27 Jan 2021 16:25:31 +0000 swh-lister (0.5.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.5.1 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-27 16:39:20 +0100) * Upstream changes: - v0.5.1 - launchpad: Actually mock the anonymous login to launchpad - Drop no longer swh.lister.core.{indexing,page_by_page}_lister - tests: Drop unneeded reset instruction - cgit: Don't stop the listing when a repository page is not available -- Software Heritage autobuilder (on jenkins-debian1) Wed, 27 Jan 2021 15:47:39 +0000 swh-lister (0.5.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.5.0 - (tagged by Antoine R. Dumont (@ardumont) on 2021-01-27 14:33:24 +0100) * Upstream changes: - v0.5.0 - cgit: Add support for last_update information during listing - Port Debian lister to new lister api - gitlab: Implement keyset-based pagination listing - cran: Retrieve last update date for each listed package - Port CRAN lister to new lister api - gitlab: Add support for last_update information during listing - Port Gitea lister to new lister api - Port cgit lister to the new lister api - bitbucket: Pick random credentials in configuration and improve logging - Port Gitlab lister to the new lister api - Port Npm lister to new lister api - Port PyPI lister to new lister api - Port Bitbucket lister to new lister api - Port Phabricator lister to new lister api - Port GitHub lister to new lister api - Introduce a simpler base pattern for lister implementations -- Software Heritage autobuilder (on jenkins-debian1) Wed, 27 Jan 2021 13:40:34 +0000 swh-lister (0.4.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.4.0 - (tagged by Antoine R. Dumont (@ardumont) on 2020-11-23 15:47:05 +0100) * Upstream changes: - v0.4.0 - requirements: Rework dependencies - tests: Reduce db initialization fixtures to a minimum - Create listing task with a default of 3 if unspecified - lister.pytest_plugin: Simplify fixture setup - tests: Clarify listers test configuration -- Software Heritage autobuilder (on jenkins-debian1) Mon, 23 Nov 2020 14:52:03 +0000 swh-lister (0.3.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.3.0 - (tagged by Antoine R. Dumont (@ardumont) on 2020-10-19 09:50:43 +0200) * Upstream changes: - v0.3.0 - lister.config: Adapt scheduler configuration structure - drop mock_get_scheduler which creates indirection for no good reason -- Software Heritage autobuilder (on jenkins-debian1) Mon, 19 Oct 2020 07:56:17 +0000 swh-lister (0.2.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.2.1 - (tagged by Antoine R. Dumont (@ardumont) on 2020-10-07 14:02:42 +0200) * Upstream changes: - v0.2.1 - lister_base: Drop leftover mixin SWHConfig which is no longer used -- Software Heritage autobuilder (on jenkins-debian1) Wed, 07 Oct 2020 12:07:43 +0000 swh-lister (0.2.0-1~swh1) unstable-swh; urgency=medium * New upstream release 0.2.0 - (tagged by Antoine R. Dumont (@ardumont) on 2020-10-06 09:33:33 +0200) * Upstream changes: - v0.2.0 - lister*: Migrate away from SWHConfig mixin - tox.ini: pin black to the pre-commit version (19.10b0) to avoid flip-flops - Run isort after the CLI import changes -- Software Heritage autobuilder (on jenkins-debian1) Tue, 06 Oct 2020 07:36:07 +0000 swh-lister (0.1.5-1~swh1) unstable-swh; urgency=medium * New upstream release 0.1.5 - (tagged by David Douard on 2020-09-25 11:51:57 +0200) * Upstream changes: - v0.1.5 -- Software Heritage autobuilder (on jenkins-debian1) Fri, 25 Sep 2020 09:55:44 +0000 swh-lister (0.1.4-1~swh1) unstable-swh; urgency=medium * New upstream release 0.1.4 - (tagged by Antoine R. Dumont (@ardumont) on 2020-09-10 11:32:46 +0200) * Upstream changes: - v0.1.4 - gitea.lister: Fix uid to be unique across instance - utils.split_range: Split into not overlapping ranges - gitea.tasks: Fix parameter name from 'sort' to 'order' -- Software Heritage autobuilder (on jenkins-debian1) Thu, 10 Sep 2020 09:35:53 +0000 swh-lister (0.1.3-1~swh1) unstable-swh; urgency=medium * New upstream release 0.1.3 - (tagged by Vincent SELLIER on 2020-09-08 14:48:08 +0200) * Upstream changes: - v0.1.3 - Launchpad: rename task name to match conventions - tests: Separate lister instantiations -- Software Heritage autobuilder (on jenkins-debian1) Tue, 08 Sep 2020 12:53:22 +0000 swh-lister (0.1.2-1~swh1) unstable-swh; urgency=medium * New upstream release 0.1.2 - (tagged by Antoine R. Dumont (@ardumont) on 2020-09-02 13:07:30 +0200) * Upstream changes: - v0.1.2 - pytest_plugin: Instantiate only lister with no particular setup - pytest: Define plugin and declare it in the root conftest -- Software Heritage autobuilder (on jenkins-debian1) Wed, 02 Sep 2020 11:10:14 +0000 swh-lister (0.1.1-1~swh1) unstable-swh; urgency=medium * New upstream release 0.1.1 - (tagged by Antoine R. Dumont (@ardumont) on 2020-09-01 16:08:48 +0200) * Upstream changes: - v0.1.1 - test_cli: Exclude launchpad lister from the check -- Software Heritage autobuilder (on jenkins-debian1) Tue, 01 Sep 2020 14:11:46 +0000 swh-lister (0.1.0-1~swh2) unstable-swh; urgency=medium * Update dependencies -- Antoine R. Dumont (@ardumont) Wed, 26 Aug 2020 16:05:03 +0000 swh-lister (0.1.0-1~swh1) unstable-swh; urgency=medium [ Nicolas Dandrimont ] * Use setuptools-scm instead of vcversioner [ Software Heritage autobuilder (on jenkins-debian1) ] * New upstream release 0.1.0 - (tagged by David Douard on 2020-08-25 18:33:55 +0200) * Upstream changes: - v0.1.0 -- Software Heritage autobuilder (on jenkins-debian1) Tue, 25 Aug 2020 16:39:28 +0000 swh-lister (0.0.50-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.50 - (tagged by Antoine R. Dumont (@ardumont) on 2020-01-20 10:44:57 +0100) * Upstream changes: - v0.0.50 - github.lister: Filter out partial repositories which break listing - docs: Fix sphinx warnings - core.lister_base: Improve slightly docs and types -- Software Heritage autobuilder (on jenkins-debian1) Mon, 20 Jan 2020 09:51:23 +0000 swh-lister (0.0.49-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.49 - (tagged by Antoine R. Dumont (@ardumont) on 2020-01-17 14:20:35 +0100) * Upstream changes: - v0.0.49 - github.lister: Use Retry-After header when rate limit reached -- Software Heritage autobuilder (on jenkins-debian1) Fri, 17 Jan 2020 13:27:56 +0000 swh-lister (0.0.48-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.48 - (tagged by Antoine R. Dumont (@ardumont) on 2020-01-16 13:56:12 +0100) * Upstream changes: - v0.0.48 - cran.lister: Use cran's canonical url for origin url - cran.lister: Version uid so we can list new package versions - cran.lister: Adapt docstring sample accordingly -- Software Heritage autobuilder (on jenkins-debian1) Thu, 16 Jan 2020 13:03:54 +0000 swh-lister (0.0.47-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.47 - (tagged by Antoine R. Dumont (@ardumont) on 2020-01-09 10:26:18 +0100) * Upstream changes: - v0.0.47 - cran.lister: Align loading tasks' with loader's expectation -- Software Heritage autobuilder (on jenkins-debian1) Thu, 09 Jan 2020 09:34:26 +0000 swh-lister (0.0.46-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.46 - (tagged by Antoine R. Dumont (@ardumont) on 2019-12-19 14:09:45 +0100) * Upstream changes: - v0.0.46 - lister.debian: Make debian init step idempotent and up-to-date - lister_base: Split into chunks the tasks prior to creation -- Software Heritage autobuilder (on jenkins-debian1) Thu, 19 Dec 2019 13:16:45 +0000 swh-lister (0.0.45-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.45 - (tagged by Antoine R. Dumont (@ardumont) on 2019-12-10 11:27:17 +0100) * Upstream changes: - v0.0.45 - core: Align listers' task output (hg/git tasks) with expected format - npm: Align lister's loader output tasks with expected format - lister/tasks: Standardize return statements -- Software Heritage autobuilder (on jenkins-debian1) Tue, 10 Dec 2019 10:32:45 +0000 swh-lister (0.0.44-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.44 - (tagged by Nicolas Dandrimont on 2019-11-22 16:15:54 +0100) * Upstream changes: - Release swh.lister v0.0.44 - Define proper User Agents everywhere -- Software Heritage autobuilder (on jenkins-debian1) Fri, 22 Nov 2019 15:31:33 +0000 swh-lister (0.0.43-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.43 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-21 18:46:35 +0100) * Upstream changes: - v0.0.43 - lister.pypi: Align lister with pypi package loader - lister.npm: Align lister with npm package loader - lister.tests: Avoid duplication setup step - Fix typos (and trailing ws) reported by codespell - Add a pre-commit config file -- Software Heritage autobuilder (on jenkins-debian1) Thu, 21 Nov 2019 17:56:34 +0000 swh-lister (0.0.42-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.42 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-21 13:52:16 +0100) * Upstream changes: - v0.0.42 - cran/gnu: Rename task_type to load-archive-files - lister.tests: Add missing task_type for package listers - Migrate tox.ini to extras = xxx instead of deps = .[testing] - Merge tox environments - Include all requirements in MANIFEST.in - lister.cli: Remove task type register cli -- Software Heritage autobuilder (on jenkins-debian1) Thu, 21 Nov 2019 13:00:29 +0000 swh-lister (0.0.41-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.41 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-15 12:02:13 +0100) * Upstream changes: - v0.0.41 - simple_lister: Flush to db more frequently - gnu.lister: Use url as primary key - gnu.lister.tests: Add missing assertion - gnu.lister: Add missing retries_left parameter - debian.models: Migrate tests from storage to debian lister model -- Software Heritage autobuilder (on jenkins-debian1) Fri, 15 Nov 2019 11:06:35 +0000 swh-lister (0.0.40-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.40 - (tagged by Nicolas Dandrimont on 2019-11-13 13:54:38 +0100) * Upstream changes: - Release swh.lister 0.0.40 - Fix bogus NotImplementedError on Area.index_uris -- Software Heritage autobuilder (on jenkins-debian1) Wed, 13 Nov 2019 13:02:08 +0000 swh-lister (0.0.39-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.39 - (tagged by Nicolas Dandrimont on 2019-11-13 13:23:31 +0100) * Upstream changes: - Release swh.lister 0.0.39 - Properly register all tasks - Fix up db_partition_indices to avoid expensive scans -- Software Heritage autobuilder (on jenkins-debian1) Wed, 13 Nov 2019 12:28:33 +0000 swh-lister (0.0.38-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.38 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-06 15:55:46 +0100) * Upstream changes: - v0.0.38 - Remove swh.storage.schemata remnants -- Software Heritage autobuilder (on jenkins-debian1) Wed, 06 Nov 2019 15:00:16 +0000 swh-lister (0.0.37-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.37 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-06 15:06:51 +0100) * Upstream changes: - v0.0.37 - Update swh-core dependency -- Software Heritage autobuilder (on jenkins-debian1) Wed, 06 Nov 2019 14:18:31 +0000 swh-lister (0.0.36-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.36 - (tagged by Antoine R. Dumont (@ardumont) on 2019-11-06 11:33:33 +0100) * Upstream changes: - v0.0.36 - lister.*.tests: Add at least one integration test - gnu.lister: Move gnu listers specifity within the lister's scope - debian/lister: Use url parameter name instead of origin - debian/model: Install lister model within the lister repository - lister.*.tasks: Stop binding tasks to a specific instance of the - celery app - cran.lister: Refactor and fix cran lister - github/lister: Prevent erroneous scheduler tasks disabling - phabricator/lister: Fix lister - setup.py: Kill deprecated swh- lister command - Bootstrap typing annotations -- Software Heritage autobuilder (on jenkins-debian1) Wed, 06 Nov 2019 10:55:41 +0000 swh-lister (0.0.35-1~swh4) unstable-swh; urgency=medium * Fix runtime dependencies -- Antoine R. Dumont (@ardumont) Wed, 11 Sep 2019 10:58:01 +0200 swh-lister (0.0.35-1~swh3) unstable-swh; urgency=medium * Bump dh-python to >= 3 for pybuild.testfiles. -- Nicolas Dandrimont Tue, 10 Sep 2019 14:58:11 +0200 swh-lister (0.0.35-1~swh2) unstable-swh; urgency=medium * Add egg-info to pybuild.testfiles. Close T1995. -- Nicolas Dandrimont Tue, 10 Sep 2019 14:36:22 +0200 swh-lister (0.0.35-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.35 - (tagged by Antoine R. Dumont (@ardumont) on 2019-09-09 12:14:42 +0200) * Upstream changes: - v0.0.35 - Fix debian package -- Software Heritage autobuilder (on jenkins-debian1) Mon, 09 Sep 2019 10:19:02 +0000 swh-lister (0.0.34-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.34 - (tagged by Antoine R. Dumont (@ardumont) on 2019-09-06 14:03:39 +0200) * Upstream changes: - v0.0.34 - listers: Implement listers as plugins - cgit: rewrite the CGit lister (and add more tests) - listers: simplify and unify constructor use - phabricator: randomly select the API token in the provided list - docs: Fix toc -- Software Heritage autobuilder (on jenkins-debian1) Fri, 06 Sep 2019 12:09:13 +0000 swh-lister (0.0.33-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.33 - (tagged by Antoine R. Dumont (@ardumont) on 2019-08-29 10:23:20 +0200) * Upstream changes: - v0.0.33 - lister.cli: Allow to list forges with policy and priority - listers: Add New packagist lister - listers: Allow to override policy and priority for scheduled tasks - tests: Add tests to cli, pypi and improve lister core's - docs: Add code of conduct document -- Software Heritage autobuilder (on jenkins-debian1) Thu, 29 Aug 2019 08:28:23 +0000 swh-lister (0.0.32-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.32 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-28 18:21:50 +0200) * Upstream changes: - v0.0.32 - Clean up dead code - Add missing *.html sample for tests to run in packaging -- Software Heritage autobuilder (on jenkins-debian1) Fri, 28 Jun 2019 16:42:05 +0000 swh-lister (0.0.31-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.31 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-28 17:57:48 +0200) * Upstream changes: - v0.0.31 - Add cgit instance lister - Add back description in cran lister - Update contributors -- Software Heritage autobuilder (on jenkins-debian1) Fri, 28 Jun 2019 16:06:25 +0000 swh-lister (0.0.30-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.30 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-26 14:52:13 +0200) * Upstream changes: - v0.0.30 - Drop last description mentions for gitlab and cran listers. -- Software Heritage autobuilder (on jenkins-debian1) Wed, 26 Jun 2019 13:02:11 +0000 swh-lister (0.0.29-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.29 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-26 12:37:14 +0200) * Upstream changes: - v0.0.29 - lister: Fix bitbucket lister -- Software Heritage autobuilder (on jenkins-debian1) Wed, 26 Jun 2019 10:47:20 +0000 swh-lister (0.0.28-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.28 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-20 12:00:09 +0200) * Upstream changes: - v0.0.28 - listers: Remove unused columns `origin_id` / `description` - gnu-lister: Use origin-type as 'tar' (and not 'gnu') - phabricator: Remove unused code -- Software Heritage autobuilder (on jenkins-debian1) Thu, 20 Jun 2019 10:07:48 +0000 swh-lister (0.0.27-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.27 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-18 10:27:09 +0200) * Upstream changes: - v0.0.27 - Unify lister tablenames to use consistently singular - Add missing instance field to phabricator repository model -- Software Heritage autobuilder (on jenkins-debian1) Tue, 18 Jun 2019 08:44:38 +0000 swh-lister (0.0.26-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.26 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-17 17:53:33 +0200) * Upstream changes: - v0.0.26 - phabricator.lister: Use credentials setup from configuration file - gitlab.lister: Remove request_params method override -- Software Heritage autobuilder (on jenkins-debian1) Mon, 17 Jun 2019 16:05:05 +0000 swh-lister (0.0.25-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.25 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-13 15:54:42 +0200) * Upstream changes: - v0.0.25 - Add new cran lister - listers: Stop creating origins when scheduling new tasks -- Software Heritage autobuilder (on jenkins-debian1) Thu, 13 Jun 2019 13:59:30 +0000 swh-lister (0.0.24-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.24 - (tagged by Antoine R. Dumont (@ardumont) on 2019-06-12 12:02:54 +0200) * Upstream changes: - v0.0.24 - swh.lister.gnu: Add new gnu lister -- Software Heritage autobuilder (on jenkins-debian1) Wed, 12 Jun 2019 10:10:56 +0000 swh-lister (0.0.23-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.23 - (tagged by Antoine R. Dumont (@ardumont) on 2019-05-29 14:04:22 +0200) * Upstream changes: - v0.0.23 - lister: Unify credentials structure between listers -- Software Heritage autobuilder (on jenkins-debian1) Wed, 29 May 2019 12:10:51 +0000 swh-lister (0.0.22-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.22 - (tagged by Antoine Lambert on 2019-05-23 10:59:39 +0200) * Upstream changes: - version 0.0.22 -- Software Heritage autobuilder (on jenkins-debian1) Thu, 23 May 2019 09:05:34 +0000 swh-lister (0.0.21-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.21 - (tagged by Antoine Lambert on 2019-04-11 11:00:55 +0200) * Upstream changes: - version 0.0.21 -- Software Heritage autobuilder (on jenkins-debian1) Thu, 11 Apr 2019 09:05:30 +0000 swh-lister (0.0.20-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.20 - (tagged by Antoine R. Dumont (@ardumont) on 2019-02-14 10:50:06 +0100) * Upstream changes: - v0.0.20 - d/*: debian packaging files migrated to separated branches - lister.cli: Fix spelling typo -- Software Heritage autobuilder (on jenkins-debian1) Thu, 14 Feb 2019 09:59:29 +0000 swh-lister (0.0.19-1~swh1) unstable-swh; urgency=medium * New upstream release 0.0.19 - (tagged by David Douard on 2019-02-07 17:36:33 +0100) * Upstream changes: - v0.0.19 -- Software Heritage autobuilder (on jenkins-debian1) Thu, 07 Feb 2019 16:42:39 +0000 swh-lister (0.0.18-1~swh1) unstable-swh; urgency=medium * v0.0.18 * docs: add title and brief module description * gitlab.lister: Break asap when problem exists during fetch info * gitlab.lister: Do not expect gitlab instances to have credentials * setup: prepare for pypi upload * gitlab/models.py: drop unused import -- Antoine R. Dumont (@ardumont) Mon, 08 Oct 2018 15:54:12 +0200 swh-lister (0.0.17-1~swh1) unstable-swh; urgency=medium * v0.0.17 * Change pypi project url to use the /project api -- Antoine R. Dumont (@ardumont) Tue, 18 Sep 2018 11:35:25 +0200 swh-lister (0.0.16-1~swh1) unstable-swh; urgency=medium * v0.0.16 * Normalize PyPI name -- Antoine R. Dumont (@ardumont) Fri, 14 Sep 2018 13:25:56 +0200 swh-lister (0.0.15-1~swh1) unstable-swh; urgency=medium * v0.0.15 * Add pypi lister -- Antoine R. Dumont (@ardumont) Thu, 06 Sep 2018 17:09:25 +0200 swh-lister (0.0.14-1~swh1) unstable-swh; urgency=medium * v0.0.14 * core.lister_base: Batch create origins (storage) & tasks (scheduler) * swh.lister.cli: Add debian lister to the list of supported listers * README.md: Update to demo the lister debian run -- Antoine R. Dumont (@ardumont) Tue, 31 Jul 2018 15:46:12 +0200 swh-lister (0.0.13-1~swh1) unstable-swh; urgency=medium * v0.0.13 * Fix missing use cases when unable to retrieve information from the api * server * gitlab/lister: Allow specifying the number of elements to * read (default is 20, same as the current gitlab api) -- Antoine R. Dumont (@ardumont) Fri, 20 Jul 2018 13:46:04 +0200 swh-lister (0.0.12-1~swh1) unstable-swh; urgency=medium * v0.0.12 * swh.lister.gitlab.tasks: Use gitlab as instance name for gitlab.com * README.md: Add gitlab to the lister implementations referenced * core/lister_base: Remove unused import -- Antoine R. Dumont (@ardumont) Thu, 19 Jul 2018 11:29:14 +0200 swh-lister (0.0.11-1~swh1) unstable-swh; urgency=medium * v0.0.11 * lister/gitlab: Add gitlab lister * docs: Update documentation to demonstrate how to run a lister locally * core/lister: Make the listers' scheduler configuration adaptable * debian/*: Fix debian packaging tests -- Antoine R. Dumont (@ardumont) Wed, 18 Jul 2018 14:16:56 +0200 swh-lister (0.0.10-1~swh1) unstable-swh; urgency=medium * Release swh.lister v0.0.10 * Add missing task_queue attribute for debian listing tasks * Make sure tests run during build * Clean up runtime dependencies -- Nicolas Dandrimont Mon, 30 Oct 2017 17:37:25 +0100 swh-lister (0.0.9-1~swh1) unstable-swh; urgency=medium * Release swh.lister v0.0.9 * Add tasks for the Debian lister -- Nicolas Dandrimont Mon, 30 Oct 2017 14:20:58 +0100 swh-lister (0.0.8-1~swh1) unstable-swh; urgency=medium * Release swh.lister v0.0.8 * Add versioned dependency on sqlalchemy -- Nicolas Dandrimont Fri, 13 Oct 2017 12:15:38 +0200 swh-lister (0.0.7-1~swh1) unstable-swh; urgency=medium * Release swh.lister version 0.0.7 * Update packaging runes -- Nicolas Dandrimont Thu, 12 Oct 2017 18:07:52 +0200 swh-lister (0.0.6-1~swh1) unstable-swh; urgency=medium * Release swh.lister v0.0.6 * Add new debian lister -- Nicolas Dandrimont Wed, 11 Oct 2017 17:59:47 +0200 swh-lister (0.0.5-1~swh1) unstable-swh; urgency=medium * Release swh.lister 0.0.5 * Make the lister more generic * Add bitbucket lister * Update tasks to new swh.scheduler API -- Nicolas Dandrimont Mon, 12 Jun 2017 18:22:13 +0200 swh-lister (0.0.4-1~swh1) unstable-swh; urgency=medium * v0.0.4 * Update storage configuration reading -- Antoine R. Dumont (@ardumont) Thu, 15 Dec 2016 19:07:24 +0100 swh-lister (0.0.3-1~swh1) unstable-swh; urgency=medium * Release swh.lister.github v0.0.3 * Generate swh.scheduler tasks and swh.storage origins on the fly * Use celery tasks to schedule own work -- Nicolas Dandrimont Thu, 20 Oct 2016 17:30:39 +0200 swh-lister (0.0.2-1~swh1) unstable-swh; urgency=medium * Release swh.lister.github 0.0.2 * Move constants to a constants module to avoid circular imports -- Nicolas Dandrimont Thu, 17 Mar 2016 20:35:11 +0100 swh-lister (0.0.1-1~swh1) unstable-swh; urgency=medium * Initial release * Release swh.lister.github v0.0.1 -- Nicolas Dandrimont Thu, 17 Mar 2016 19:01:20 +0100 diff --git a/swh.lister.egg-info/PKG-INFO b/swh.lister.egg-info/PKG-INFO index 09fe8d7..a1e696e 100644 --- a/swh.lister.egg-info/PKG-INFO +++ b/swh.lister.egg-info/PKG-INFO @@ -1,125 +1,125 @@ Metadata-Version: 2.1 Name: swh.lister -Version: 4.0.1 +Version: 4.1.0 Summary: Software Heritage lister Home-page: https://forge.softwareheritage.org/diffusion/DLSGH/ Author: Software Heritage developers Author-email: swh-devel@inria.fr Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-lister Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-lister/ Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 5 - Production/Stable Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE swh-lister ========== This component from the Software Heritage stack aims to produce listings of software origins and their urls hosted on various public developer platforms or package managers. As these operations are quite similar, it provides a set of Python modules abstracting common software origins listing behaviors. It also provides several lister implementations, contained in the following Python modules: - `swh.lister.bitbucket` - `swh.lister.cgit` - `swh.lister.cran` - `swh.lister.debian` - `swh.lister.gitea` - `swh.lister.github` - `swh.lister.gitlab` - `swh.lister.gnu` - `swh.lister.golang` - `swh.lister.launchpad` - `swh.lister.maven` - `swh.lister.npm` - `swh.lister.packagist` - `swh.lister.phabricator` - `swh.lister.pypi` - `swh.lister.tuleap` - `swh.lister.gogs` Dependencies ------------ All required dependencies can be found in the `requirements*.txt` files located at the root of the repository. Local deployment ---------------- ## lister configuration Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`, `gitea`, `github`, `gitlab`, `gnu`, `golang`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`) must be configured by following the instructions below (please note that you have to replace `` by one of the lister name introduced above). ### Preparation steps 1. `mkdir ~/.config/swh/` 2. create configuration file `~/.config/swh/listers.yml` ### Configuration file sample Minimalistic configuration shared by all listers to add in file `~/.config/swh/listers.yml`: ```lang=yml scheduler: cls: 'remote' args: url: 'http://localhost:5008/' credentials: {} ``` Note: This expects scheduler (5008) service to run locally ## Executing a lister Once configured, a lister can be executed by using the `swh` CLI tool with the following options and commands: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister [lister_parameters] ``` Examples: ``` $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister bitbucket $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister cran $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitea url=https://codeberg.org/api/v1/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister gitlab url=https://salsa.debian.org/api/v4/ $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister npm $ swh --log-level DEBUG lister -C ~/.config/swh/listers.yml run --lister pypi ``` Licensing --------- This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. diff --git a/swh/lister/__init__.py b/swh/lister/__init__.py index eaa5efd..28e81ef 100644 --- a/swh/lister/__init__.py +++ b/swh/lister/__init__.py @@ -1,84 +1,91 @@ # Copyright (C) 2018-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging import pkg_resources logger = logging.getLogger(__name__) try: __version__ = pkg_resources.get_distribution("swh.lister").version except pkg_resources.DistributionNotFound: __version__ = "devel" USER_AGENT_TEMPLATE = ( f"Software Heritage %s lister v{__version__}" " (+https://www.softwareheritage.org/contact)" ) LISTERS = { entry_point.name.split(".", 1)[1]: entry_point for entry_point in pkg_resources.iter_entry_points("swh.workers") if entry_point.name.split(".", 1)[0] == "lister" } SUPPORTED_LISTERS = list(LISTERS) TARBALL_EXTENSIONS = [ "crate", "gem", "jar", + "love", # zip "zip", "tar", "gz", "tgz", "tbz", "bz2", "bzip2", "lzma", "lz", "txz", "xz", "z", "Z", "7z", + "oxt", # zip + "pak", # zip + "war", # zip + "whl", # zip + "vsix", # zip + "VSIXPackage", # zip "zst", ] """Tarball recognition pattern""" def get_lister(lister_name, db_url=None, **conf): """Instantiate a lister given its name. Args: lister_name (str): Lister's name conf (dict): Configuration dict (lister db cnx, policy, priority...) Returns: Tuple (instantiated lister, drop_tables function, init schema function, insert minimum data function) """ if lister_name not in LISTERS: raise ValueError( "Invalid lister %s: only supported listers are %s" % (lister_name, SUPPORTED_LISTERS) ) if db_url: conf["lister"] = {"cls": "local", "args": {"db": db_url}} registry_entry = LISTERS[lister_name].load()() lister_cls = registry_entry["lister"] from swh.lister import pattern if issubclass(lister_cls, pattern.Lister): return lister_cls.from_config(**conf) else: # Old-style lister return lister_cls(override_config=conf) diff --git a/swh/lister/conda/lister.py b/swh/lister/conda/lister.py index eddc15d..ab0190f 100644 --- a/swh/lister/conda/lister.py +++ b/swh/lister/conda/lister.py @@ -1,123 +1,123 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import bz2 from collections import defaultdict import datetime import json import logging from typing import Any, Dict, Iterator, List, Optional, Tuple import iso8601 from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from ..pattern import CredentialsType, StatelessLister logger = logging.getLogger(__name__) # Aliasing the page results returned by `get_pages` method from the lister. CondaListerPage = Tuple[str, Dict[str, Dict[str, Any]]] class CondaLister(StatelessLister[CondaListerPage]): """List Conda (anaconda.com) origins.""" LISTER_NAME = "conda" VISIT_TYPE = "conda" INSTANCE = "conda" BASE_REPO_URL = "https://repo.anaconda.com/pkgs" REPO_URL_PATTERN = "{url}/{channel}/{arch}/repodata.json.bz2" ORIGIN_URL_PATTERN = "https://anaconda.org/{channel}/{pkgname}" ARCHIVE_URL_PATTERN = "{url}/{channel}/{arch}/{filename}" def __init__( self, scheduler: SchedulerInterface, credentials: Optional[CredentialsType] = None, url: str = BASE_REPO_URL, channel: str = "", archs: List = [], ): super().__init__( scheduler=scheduler, credentials=credentials, instance=self.INSTANCE, url=url, ) self.channel: str = channel self.archs: List[str] = archs self.packages: Dict[str, Any] = defaultdict(dict) self.package_dates: Dict[str, Any] = defaultdict(list) def get_pages(self) -> Iterator[CondaListerPage]: """Yield an iterator which returns 'page'""" for arch in self.archs: repodata_url = self.REPO_URL_PATTERN.format( url=self.url, channel=self.channel, arch=arch ) response = self.http_request(url=repodata_url) packages: Dict[str, Any] = json.loads(bz2.decompress(response.content))[ "packages" ] yield (arch, packages) def get_origins_from_page(self, page: CondaListerPage) -> Iterator[ListedOrigin]: """Iterate on all pages and yield ListedOrigin instances.""" assert self.lister_obj.id is not None arch, packages = page + package_names = set() for filename, package_metadata in packages.items(): + package_names.add(package_metadata["name"]) version_key = ( f"{arch}/{package_metadata['version']}-{package_metadata['build']}" ) artifact: Dict[str, Any] = { "filename": filename, "url": self.ARCHIVE_URL_PATTERN.format( url=self.url, channel=self.channel, filename=filename, arch=arch, ), "version": version_key, "checksums": {}, } for checksum in ("md5", "sha256"): if checksum in package_metadata: artifact["checksums"][checksum] = package_metadata[checksum] self.packages[package_metadata["name"]][version_key] = artifact package_date = None if "timestamp" in package_metadata: package_date = datetime.datetime.fromtimestamp( package_metadata["timestamp"] / 1e3, datetime.timezone.utc ) elif "date" in package_metadata: package_date = iso8601.parse_date(package_metadata["date"]) - last_update = None if package_date: artifact["date"] = package_date.isoformat() self.package_dates[package_metadata["name"]].append(package_date) - last_update = max(self.package_dates[package_metadata["name"]]) + for package_name in package_names: + package_dates = self.package_dates[package_name] yield ListedOrigin( lister_id=self.lister_obj.id, visit_type=self.VISIT_TYPE, url=self.ORIGIN_URL_PATTERN.format( - channel=self.channel, pkgname=package_metadata["name"] + channel=self.channel, pkgname=package_name ), - last_update=last_update, + last_update=max(package_dates, default=None), extra_loader_arguments={ - "artifacts": [ - v for k, v in self.packages[package_metadata["name"]].items() - ], + "artifacts": list(self.packages[package_name].values()) }, ) diff --git a/swh/lister/conda/tests/test_lister.py b/swh/lister/conda/tests/test_lister.py index dd01064..49580ab 100644 --- a/swh/lister/conda/tests/test_lister.py +++ b/swh/lister/conda/tests/test_lister.py @@ -1,94 +1,119 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -from swh.lister.conda.lister import CondaLister - - -def test_conda_lister_free_channel(datadir, requests_mock_datadir, swh_scheduler): - lister = CondaLister( - scheduler=swh_scheduler, channel="free", archs=["linux-64", "osx-64", "win-64"] - ) - res = lister.run() - - assert res.pages == 3 - assert res.origins == 11 - - -def test_conda_lister_conda_forge_channel( - datadir, requests_mock_datadir, swh_scheduler -): - lister = CondaLister( - scheduler=swh_scheduler, - url="https://conda.anaconda.org", - channel="conda-forge", - archs=["linux-64"], - ) - res = lister.run() +import pytest - assert res.pages == 1 - assert res.origins == 2 +from swh.lister.conda.lister import CondaLister - scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results - expected_origins = [ +@pytest.fixture +def expected_origins(): + return [ { "url": "https://anaconda.org/conda-forge/21cmfast", "artifacts": [ { "url": "https://conda.anaconda.org/conda-forge/linux-64/21cmfast-3.0.2-py36h1af98f8_1.tar.bz2", # noqa: B950 "date": "2020-11-11T16:04:49.658000+00:00", "version": "linux-64/3.0.2-py36h1af98f8_1", "filename": "21cmfast-3.0.2-py36h1af98f8_1.tar.bz2", "checksums": { "md5": "d65ab674acf3b7294ebacaec05fc5b54", "sha256": "1154fceeb5c4ee9bb97d245713ac21eb1910237c724d2b7103747215663273c2", # noqa: B950 }, } ], }, { "url": "https://anaconda.org/conda-forge/lifetimes", "artifacts": [ { "url": "https://conda.anaconda.org/conda-forge/linux-64/lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2", # noqa: B950 "date": "2020-07-06T12:19:36.425000+00:00", "version": "linux-64/0.11.1-py36h9f0ad1d_1", "filename": "lifetimes-0.11.1-py36h9f0ad1d_1.tar.bz2", "checksums": { "md5": "faa398f7ba0d60ce44aa6eeded490cee", "sha256": "f82a352dfae8abceeeaa538b220fd9c5e4aa4e59092a6a6cea70b9ec0581ea03", # noqa: B950 }, }, { "url": "https://conda.anaconda.org/conda-forge/linux-64/lifetimes-0.11.1-py36hc560c46_1.tar.bz2", # noqa: B950 "date": "2020-07-06T12:19:37.032000+00:00", "version": "linux-64/0.11.1-py36hc560c46_1", "filename": "lifetimes-0.11.1-py36hc560c46_1.tar.bz2", "checksums": { "md5": "c53a689a4c5948e84211bdfc23e3fe68", "sha256": "76146c2ebd6e3b65928bde53a2585287759d77beba785c0eeb889ee565c0035d", # noqa: B950 }, }, ], }, ] + +def test_conda_lister_free_channel(datadir, requests_mock_datadir, swh_scheduler): + lister = CondaLister( + scheduler=swh_scheduler, channel="free", archs=["linux-64", "osx-64", "win-64"] + ) + res = lister.run() + + assert res.pages == 3 + assert res.origins == 11 + + +def test_conda_lister_conda_forge_channel( + requests_mock_datadir, swh_scheduler, expected_origins +): + lister = CondaLister( + scheduler=swh_scheduler, + url="https://conda.anaconda.org", + channel="conda-forge", + archs=["linux-64"], + ) + res = lister.run() + + assert res.pages == 1 + assert res.origins == 2 + + scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results + assert len(scheduler_origins) == len(expected_origins) assert [ ( scheduled.visit_type, scheduled.url, scheduled.extra_loader_arguments["artifacts"], ) for scheduled in sorted(scheduler_origins, key=lambda scheduled: scheduled.url) ] == [ ( "conda", expected["url"], expected["artifacts"], ) for expected in sorted(expected_origins, key=lambda expected: expected["url"]) ] + + +def test_conda_lister_number_of_yielded_origins( + requests_mock_datadir, swh_scheduler, expected_origins +): + """Check that a single ListedOrigin instance is sent by expected origins.""" + lister = CondaLister( + scheduler=swh_scheduler, + url="https://conda.anaconda.org", + channel="conda-forge", + archs=["linux-64"], + ) + + listed_origins = [] + for page in lister.get_pages(): + listed_origins += list(lister.get_origins_from_page(page)) + + assert sorted([listed_origin.url for listed_origin in listed_origins]) == sorted( + [origin["url"] for origin in expected_origins] + ) diff --git a/swh/lister/nixguix/lister.py b/swh/lister/nixguix/lister.py index 1dbd4de..0b8e8be 100644 --- a/swh/lister/nixguix/lister.py +++ b/swh/lister/nixguix/lister.py @@ -1,490 +1,576 @@ # Copyright (C) 2020-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """NixGuix lister definition. This lists artifacts out of manifest for Guix or Nixpkgs manifests. Artifacts can be of types: - upstream git repository (NixOS/nixpkgs, Guix) - VCS repositories (svn, git, hg, ...) - unique file - unique tarball """ import base64 import binascii from dataclasses import dataclass from enum import Enum import logging from pathlib import Path import random +import re from typing import Any, Dict, Iterator, List, Optional, Tuple, Union from urllib.parse import parse_qsl, urlparse import requests from requests.exceptions import ConnectionError, InvalidSchema, SSLError from swh.core.github.utils import GitHubSession from swh.core.tarball import MIMETYPE_TO_ARCHIVE_FORMAT from swh.lister import TARBALL_EXTENSIONS from swh.lister.pattern import CredentialsType, StatelessLister from swh.scheduler.model import ListedOrigin logger = logging.getLogger(__name__) +# By default, ignore binary files and archives containing binaries +DEFAULT_EXTENSIONS_TO_IGNORE = [ + "AppImage", + "bin", + "exe", + "iso", + "linux64", + "msi", + "png", + "dic", + "deb", + "rpm", +] + + class ArtifactNatureUndetected(ValueError): """Raised when a remote artifact's nature (tarball, file) cannot be detected.""" pass class ArtifactNatureMistyped(ValueError): """Raised when a remote artifact is neither a tarball nor a file. Error of this type are' probably a misconfiguration in the manifest generation that badly typed a vcs repository. """ pass class ArtifactWithoutExtension(ValueError): - """Raised when an artifact nature cannot be determined by its name. - - This exception is solely for internal use of the :meth:`is_tarball` method. - - """ + """Raised when an artifact nature cannot be determined by its name.""" pass class ChecksumsComputation(Enum): """The possible artifact types listed out of the manifest.""" STANDARD = "standard" """Standard checksums (e.g. sha1, sha256, ...) on the tarball or file.""" NAR = "nar" """The hash is computed over the NAR archive dump of the output (e.g. uncompressed directory.)""" MAPPING_CHECKSUMS_COMPUTATION = { "flat": ChecksumsComputation.STANDARD, "recursive": ChecksumsComputation.NAR, } """Mapping between the outputHashMode from the manifest and how to compute checksums.""" @dataclass class Artifact: """Metadata information on Remote Artifact with url (tarball or file).""" origin: str """Canonical url retrieve the tarball artifact.""" visit_type: str """Either 'tar' or 'file' """ fallback_urls: List[str] """List of urls to retrieve tarball artifact if canonical url no longer works.""" checksums: Dict[str, str] """Integrity hash converted into a checksum dict.""" checksums_computation: ChecksumsComputation """Checksums computation mode to provide to loaders (e.g. nar, standard, ...)""" @dataclass class VCS: """Metadata information on VCS.""" origin: str """Origin url of the vcs""" type: str """Type of (d)vcs, e.g. svn, git, hg, ...""" ref: Optional[str] = None """Reference either a svn commit id, a git commit, ...""" class ArtifactType(Enum): """The possible artifact types listed out of the manifest.""" ARTIFACT = "artifact" VCS = "vcs" PageResult = Tuple[ArtifactType, Union[Artifact, VCS]] VCS_SUPPORTED = ("git", "svn", "hg") # Rough approximation of what we can find of mimetypes for tarballs "out there" POSSIBLE_TARBALL_MIMETYPES = tuple(MIMETYPE_TO_ARCHIVE_FORMAT.keys()) +PATTERN_VERSION = re.compile(r"(v*[0-9]+[.])([0-9]+[.]*)+") + + +def url_endswith( + urlparsed, extensions: List[str], raise_when_no_extension: bool = True +) -> bool: + """Determine whether urlparsed ends with one of the extensions passed as parameter. + + This also account for the edge case of a filename with only a version as name (so no + extension in the end.) + + Raises: + ArtifactWithoutExtension in case no extension is available and + raise_when_no_extension is True (the default) + + """ + paths = [Path(p) for (_, p) in [("_", urlparsed.path)] + parse_qsl(urlparsed.query)] + if raise_when_no_extension and not any(path.suffix != "" for path in paths): + raise ArtifactWithoutExtension + match = any(path.suffix.endswith(tuple(extensions)) for path in paths) + if match: + return match + # Some false negative can happen (e.g. https:///path/0.1.5)), so make sure + # to catch those + name = Path(urlparsed.path).name + if not PATTERN_VERSION.match(name): + return match + if raise_when_no_extension: + raise ArtifactWithoutExtension + return False + + def is_tarball(urls: List[str], request: Optional[Any] = None) -> Tuple[bool, str]: """Determine whether a list of files actually are tarballs or simple files. When this cannot be answered simply out of the url, when request is provided, this executes a HTTP `HEAD` query on the url to determine the information. If request is not provided, this raises an ArtifactNatureUndetected exception. Args: urls: name of the remote files for which the extension needs to be checked. Raises: ArtifactNatureUndetected when the artifact's nature cannot be detected out of its url ArtifactNatureMistyped when the artifact is not a tarball nor a file. It's up to the caller to do what's right with it. Returns: A tuple (bool, url). The boolean represents whether the url is an archive or not. The second parameter is the actual url once the head request is issued as a fallback of not finding out whether the urls are tarballs or not. """ def _is_tarball(url): """Determine out of an extension whether url is a tarball. Raises: ArtifactWithoutExtension in case no extension is available """ urlparsed = urlparse(url) if urlparsed.scheme not in ("http", "https", "ftp"): raise ArtifactNatureMistyped(f"Mistyped artifact '{url}'") - - paths = [ - Path(p) for (_, p) in [("_", urlparsed.path)] + parse_qsl(urlparsed.query) - ] - if not any(path.suffix != "" for path in paths): - raise ArtifactWithoutExtension - return any(path.suffix.endswith(tuple(TARBALL_EXTENSIONS)) for path in paths) + return url_endswith(urlparsed, TARBALL_EXTENSIONS) index = random.randrange(len(urls)) url = urls[index] try: return _is_tarball(url), urls[0] except ArtifactWithoutExtension: if request is None: raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) logger.warning( "Cannot detect extension for <%s>. Fallback to http head query", url, ) try: response = request.head(url) except (InvalidSchema, SSLError, ConnectionError): raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) if not response.ok or response.status_code == 404: raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) location = response.headers.get("Location") if location: # It's not always present logger.debug("Location: %s", location) try: # FIXME: location is also returned as it's considered the true origin, # true enough? return _is_tarball(location), location except ArtifactWithoutExtension: logger.warning( "Still cannot detect extension through location <%s>...", url, ) + origin = urls[0] + content_type = response.headers.get("Content-Type") if content_type: logger.debug("Content-Type: %s", content_type) if content_type == "application/json": - return False, urls[0] - return content_type.startswith(POSSIBLE_TARBALL_MIMETYPES), urls[0] + return False, origin + return content_type.startswith(POSSIBLE_TARBALL_MIMETYPES), origin + + content_disposition = response.headers.get("Content-Disposition") + if content_disposition: + logger.debug("Content-Disposition: %s", content_disposition) + if "filename=" in content_disposition: + fields = content_disposition.split("; ") + for field in fields: + if "filename=" in field: + _, filename = field.split("filename=") + break + + return ( + url_endswith( + urlparse(filename), + TARBALL_EXTENSIONS, + raise_when_no_extension=False, + ), + origin, + ) raise ArtifactNatureUndetected( f"Cannot determine artifact type from url <{url}>" ) VCS_KEYS_MAPPING = { "git": { "ref": "git_ref", "url": "git_url", }, "svn": { "ref": "svn_revision", "url": "svn_url", }, "hg": { "ref": "hg_changeset", "url": "hg_url", }, } class NixGuixLister(StatelessLister[PageResult]): """List Guix or Nix sources out of a public json manifest. This lister can output: - unique tarball (.tar.gz, .tbz2, ...) - vcs repositories (e.g. git, hg, svn) - unique file (.lisp, .py, ...) Note that no `last_update` is available in either manifest. For `url` types artifacts, this tries to determine the artifact's nature, tarball or file. It first tries to compute out of the "url" extension. In case of no extension, it fallbacks to query (HEAD) the url to retrieve the origin out of the `Location` response header, and then checks the extension again. + Optionally, when the `extension_to_ignore` parameter is provided, it extends the + default extensions to ignore (`DEFAULT_EXTENSIONS_TO_IGNORE`) with those passed. + This can be used to drop further binary files detected in the wild. + """ LISTER_NAME = "nixguix" def __init__( self, scheduler, url: str, origin_upstream: str, instance: Optional[str] = None, credentials: Optional[CredentialsType] = None, # canonicalize urls, can be turned off during docker runs canonicalize: bool = True, + extensions_to_ignore: List[str] = [], **kwargs: Any, ): super().__init__( scheduler=scheduler, url=url.rstrip("/"), instance=instance, credentials=credentials, ) # either full fqdn NixOS/nixpkgs or guix repository urls # maybe add an assert on those specific urls? self.origin_upstream = origin_upstream + self.extensions_to_ignore = DEFAULT_EXTENSIONS_TO_IGNORE + extensions_to_ignore self.session = requests.Session() # for testing purposes, we may want to skip this step (e.g. docker run and rate # limit) self.github_session = ( GitHubSession( credentials=self.credentials, user_agent=str(self.session.headers["User-Agent"]), ) if canonicalize else None ) def build_artifact( self, artifact_url: str, artifact_type: str, artifact_ref: Optional[str] = None ) -> Optional[Tuple[ArtifactType, VCS]]: """Build a canonicalized vcs artifact when possible.""" origin = ( self.github_session.get_canonical_url(artifact_url) if self.github_session else artifact_url ) if not origin: return None return ArtifactType.VCS, VCS( origin=origin, type=artifact_type, ref=artifact_ref ) def get_pages(self) -> Iterator[PageResult]: """Yield one page per "typed" origin referenced in manifest.""" # fetch and parse the manifest... response = self.http_request(self.url) # ... if any raw_data = response.json() yield ArtifactType.VCS, VCS(origin=self.origin_upstream, type="git") # grep '"type"' guix-sources.json | sort | uniq # "type": false <<<<<<<<< noise # "type": "git", # "type": "hg", # "type": "no-origin", <<<<<<<<< noise # "type": "svn", # "type": "url", # grep '"type"' nixpkgs-sources-unstable.json | sort | uniq # "type": "url", sources = raw_data["sources"] random.shuffle(sources) for artifact in sources: artifact_type = artifact["type"] if artifact_type in VCS_SUPPORTED: plain_url = artifact[VCS_KEYS_MAPPING[artifact_type]["url"]] plain_ref = artifact[VCS_KEYS_MAPPING[artifact_type]["ref"]] built_artifact = self.build_artifact( plain_url, artifact_type, plain_ref ) if not built_artifact: continue yield built_artifact elif artifact_type == "url": # It's either a tarball or a file origin_urls = artifact.get("urls") if not origin_urls: # Nothing to fetch logger.warning("Skipping url <%s>: empty artifact", artifact) continue assert origin_urls is not None # Deal with urls with empty scheme (basic fallback to http) urls = [] for url in origin_urls: urlparsed = urlparse(url) if urlparsed.scheme == "": logger.warning("Missing scheme for <%s>: fallback to http", url) fixed_url = f"http://{url}" else: fixed_url = url urls.append(fixed_url) origin, *fallback_urls = urls if origin.endswith(".git"): built_artifact = self.build_artifact(origin, "git") if not built_artifact: continue yield built_artifact continue outputHash = artifact.get("outputHash") integrity = artifact.get("integrity") if integrity is None and outputHash is None: logger.warning( "Skipping url <%s>: missing integrity and outputHash field", origin, ) continue # Falls back to outputHash field if integrity is missing if integrity is None and outputHash: # We'll deal with outputHash as integrity field integrity = outputHash try: is_tar, origin = is_tarball(urls, self.session) except ArtifactNatureMistyped: logger.warning( "Mistyped url <%s>: trying to deal with it properly", origin ) urlparsed = urlparse(origin) artifact_type = urlparsed.scheme if artifact_type in VCS_SUPPORTED: built_artifact = self.build_artifact(origin, artifact_type) if not built_artifact: continue yield built_artifact else: logger.warning( "Skipping url <%s>: undetected remote artifact type", origin ) continue except ArtifactNatureUndetected: logger.warning( "Skipping url <%s>: undetected remote artifact type", origin ) continue # Determine the content checksum stored in the integrity field and # convert into a dict of checksums. This only parses the # `hash-expression` (hash-) as defined in # https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute try: chksum_algo, chksum_b64 = integrity.split("-") checksums: Dict[str, str] = { chksum_algo: base64.decodebytes(chksum_b64.encode()).hex() } except binascii.Error: logger.exception( "Skipping url: <%s>: integrity computation failure for <%s>", url, artifact, ) continue # The 'outputHashMode' attribute determines how the hash is computed. It # must be one of the following two values: # - "flat": (default) The output must be a non-executable regular file. # If it isn’t, the build fails. The hash is simply computed over the # contents of that file (so it’s equal to what Unix commands like # `sha256sum` or `sha1sum` produce). # - "recursive": The hash is computed over the NAR archive dump of the # output (i.e., the result of `nix-store --dump`). In this case, # the output can be anything, including a directory tree. outputHashMode = artifact.get("outputHashMode", "flat") if not is_tar and outputHashMode == "recursive": # T4608: Cannot deal with those properly yet as some can be missing # 'critical' information about how to recompute the hash (e.g. fs # layout, executable bit, ...) logger.warning( - "Skipping artifact <%s>: 'file' artifact of type <%s> is " + "Skipping artifact <%s>: 'file' artifact of type <%s> is" " missing information to properly check its integrity", artifact, artifact_type, ) continue + # At this point plenty of heuristics happened and we should have found + # the right origin and its nature. + + # Let's check and filter it out if it is to be ignored (if possible). + # Some origin urls may not have extension at this point (e.g + # http://git.marmaro.de/?p=mmh;a=snp;h=;sf=tgz), let them through. + if url_endswith( + urlparse(origin), + self.extensions_to_ignore, + raise_when_no_extension=False, + ): + logger.warning( + "Skipping artifact <%s>: 'file' artifact of type <%s> is" + " ignored due to lister configuration. It should ignore" + " origins with extension [%s]", + origin, + artifact_type, + ",".join(self.extensions_to_ignore), + ) + continue + logger.debug("%s: %s", "dir" if is_tar else "cnt", origin) yield ArtifactType.ARTIFACT, Artifact( origin=origin, fallback_urls=fallback_urls, checksums=checksums, checksums_computation=MAPPING_CHECKSUMS_COMPUTATION[outputHashMode], visit_type="directory" if is_tar else "content", ) else: logger.warning( "Skipping artifact <%s>: unsupported type %s", artifact, artifact_type, ) def vcs_to_listed_origin(self, artifact: VCS) -> Iterator[ListedOrigin]: """Given a vcs repository, yield a ListedOrigin.""" assert self.lister_obj.id is not None # FIXME: What to do with the "ref" (e.g. git/hg/svn commit, ...) yield ListedOrigin( lister_id=self.lister_obj.id, url=artifact.origin, visit_type=artifact.type, ) def artifact_to_listed_origin(self, artifact: Artifact) -> Iterator[ListedOrigin]: """Given an artifact (tarball, file), yield one ListedOrigin.""" assert self.lister_obj.id is not None yield ListedOrigin( lister_id=self.lister_obj.id, url=artifact.origin, visit_type=artifact.visit_type, extra_loader_arguments={ "checksums": artifact.checksums, "checksums_computation": artifact.checksums_computation.value, "fallback_urls": artifact.fallback_urls, }, ) def get_origins_from_page( self, artifact_tuple: PageResult ) -> Iterator[ListedOrigin]: """Given an artifact tuple (type, artifact), yield a ListedOrigin.""" artifact_type, artifact = artifact_tuple mapping_type_fn = getattr(self, f"{artifact_type.value}_to_listed_origin") yield from mapping_type_fn(artifact) diff --git a/swh/lister/nixguix/tests/data/sources-failure.json b/swh/lister/nixguix/tests/data/sources-failure.json index e0844af..237a018 100644 --- a/swh/lister/nixguix/tests/data/sources-failure.json +++ b/swh/lister/nixguix/tests/data/sources-failure.json @@ -1,64 +1,181 @@ { "sources": [ {"type": "git", "git_url": "", "git_ref": ""}, {"type": false}, {"type": "no-origin"}, {"type": "url", "urls": []}, { "type": "url", "urls": ["https://crates.io/api/v1/0.1.5/no-extension-and-head-404-so-skipped"], "integrity": "sha256-HW6jxFlbljY8E5Q0l9s0r0Rg+0dKlcQ/REatNBuMl4U=" }, { "type": "url", "urls": [ "https://example.org/another-file-no-integrity-so-skipped.txt" ] }, { "type": "url", "urls": [ "ftp://ftp.ourproject.org/file-with-no-extension" ], "integrity": "sha256-bss09x9yOnuW+Q5BHHjf8nNcCNxCKMdl9/2/jKSFcrQ=" }, { "type": "url", "urls": [ "https://git-tails.immerda.ch/onioncircuits" ], "integrity": "sha256-lV3xiWUZmSnt4LW0ni/sUyC/bbtaxkTzvFLFtJKLuI4=" }, { "outputHash": "sha256-9uF0fYl4Zz/Ia2UKx7CBi8ZU8jfWoBfy2QSgTSwXo5A", "outputHashAlgo": null, "outputHashMode": "recursive", "type": "url", "urls": [ "https://github.com/figiel/hosts/archive/v1.0.0.tar.gz" ], "inferredFetcher": "fetchzip" }, { "outputHash": "0s2mvy1nr2v1x0rr1fxlsv8ly1vyf9978rb4hwry5vnr678ls522", "outputHashAlgo": "sha256", "outputHashMode": "recursive", "type": "url", "urls": [ "https://www.unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt" ], "integrity": "sha256-QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg=", "inferredFetcher": "unclassified" }, { "type": "url", "urls": [ "unknown://example.org/wrong-scheme-so-skipped.txt" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" }, { "type": "url", "urls": [ "https://code.9front.org/hg/plan9front" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" + }, + { + "outputHash": "sha256-IgPqUEDpaIuGoaGoH2GCEzh3KxF3pkJC3VjTYXwSiQE=", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/KSP-CKAN/CKAN/releases/download/v1.30.4/ckan.exe" + ], + "integrity": "sha256-IgPqUEDpaIuGoaGoH2GCEzh3KxF3pkJC3VjTYXwSiQE=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "sha256-ezJN/t0iNk0haMLPioEQSNXU4ugVeJe44GNVGd+cOF4=", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/johannesjo/super-productivity/releases/download/v7.5.1/superProductivity-7.5.1.AppImage" + ], + "integrity": "sha256-ezJN/t0iNk0haMLPioEQSNXU4ugVeJe44GNVGd+cOF4=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "19ir6x4c01825hpx2wbbcxkk70ymwbw4j03v8b2xc13ayylwzx0r", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "http://gorilla.dp100.com/downloads/gorilla1537_64.bin" + ], + "integrity": "sha256-GfTPqfdqBNbFQnsASfji1YMzZ2drcdEvLAIFwEg3OaY=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "1zj53xybygps66m3v5kzi61vqy987zp6bfgk0qin9pja68qq75vx", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/virtio-win-0.1.196-1/virtio-win.iso" + ], + "integrity": "sha256-fZeDMTJK3mQjBvO5Ze4/KHm8g4l/lj2qMfo+v3wfRf4=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "02qgsj4h4zrjxkcclx7clsqbqd699kg0dq1xxa9hbj3vfnddjv1f", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://www.pjrc.com/teensy/td_153/TeensyduinoInstall.linux64" + ], + "integrity": "sha256-LmzZmnV7yAWT6j3gBt5MyTS8sKbsdMrY7DJ/AonUDws=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "sha256-24uF87kQWQ9hrb+gAFqZXWE+KZocxz0AVT1w3IEBDjY=", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://dl.winehq.org/wine/wine-mono/6.4.0/wine-mono-6.4.0-x86.msi" + ], + "integrity": "sha256-24uF87kQWQ9hrb+gAFqZXWE+KZocxz0AVT1w3IEBDjY=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "00y96w9shbbrdbf6xcjlahqd08154kkrxmqraik7qshiwcqpw7p4", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://raw.githubusercontent.com/webtorrent/webtorrent-desktop/v0.21.0/static/linux/share/icons/hicolor/48x48/apps/webtorrent-desktop.png" + ], + "integrity": "sha256-5B5+MeMRanxmVBnXnuckJSDQMFRUsm7canktqBM3yQM=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "0lw193jr7ldvln5x5z9p21rz1by46h0say9whfcw2kxs9vprd5b3", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "http://xuxen.eus/static/hunspell/eu_ES.dic" + ], + "integrity": "sha256-Y5WW7066T8GZgzx5pQE0xK/wcxA3/dKLpbvRk+VIgVM=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "0wbhvypdr96a5ddg6kj41dn9sbl49n7pfi2vs762ij82hm2gvwcm", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://www.openprinting.org/download/printdriver/components/lsb3.2/main/RPMS/noarch/openprinting-ppds-postscript-lexmark-20160218-1lsb3.2.noarch.rpm" + ], + "integrity": "sha256-lfH9RIUCySjM0VtEd49NhC6dbAtETvNaK8qk3K7fcHE=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "01gy84gr0gw5ap7hpy72azaf6hlzac7vxkn5cgad5sfbyzxgjgc9", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://wire-app.wire.com/linux/debian/pool/main/Wire-3.26.2941_amd64.deb" + ], + "integrity": "sha256-iT35+vfL6dLUY8XOvg9Tn0Lj1Ffi+AvPVYU/kB9B/gU=", + "inferredFetcher": "unclassified" + }, + { + "type": "url", + "urls": [ + "https://elpa.gnu.org/packages/zones.foobar" + ], + "integrity": "sha256-YRZc7dI3DjUzoSIp4fIshUyhMXIQ/fPKaKnjeYVa4WI=" } ], "version":"1", "revision":"ab59155c5a38dda7efaceb47c7528578fcf0def4" } diff --git a/swh/lister/nixguix/tests/data/sources-success.json b/swh/lister/nixguix/tests/data/sources-success.json index bb1943c..05fdd79 100644 --- a/swh/lister/nixguix/tests/data/sources-success.json +++ b/swh/lister/nixguix/tests/data/sources-success.json @@ -1,107 +1,293 @@ { "sources": [ { "type": "url", "urls": [ "https://github.com/owner-1/repository-1/revision-1.tgz" ], "integrity": "sha256-3vm2Nt+O4zHf3Ovd/qsv1gKTEUwodX9FLxlrQdry0zs=" }, { "type": "url", - "urls": [ "https://github.com/owner-3/repository-1/revision-1.tgz" ], + "urls": [ "https://github.com/owner-3/repository-1/revision-1.tar" ], "integrity": "sha256-3vm2Nt+O4zHf3Ovd/qsv1gKTEUwodX9FLxlrQdry0zs=" }, { "type": "url", "urls": [ "https://example.com/file.txt" ], "integrity": "sha256-Q0copBCnj1b8G1iZw1k0NuYasMcx6QctleltspAgXlM=" }, { "type": "url", "urls": [ "https://releases.wildfiregames.com/0ad-0.0.25b-alpha-unix-build.tar.xz" ], "integrity": "sha256-1w3NdfRzp9XIFDLD2SYJJr+Nnf9c1UF5YWlJfRxSLt0=" }, { "type": "url", "urls": [ "ftp://ftp.ourproject.org/pub/ytalk/ytalk-3.3.0.tar.gz" ], "integrity": "sha256-bss09x9yOnuW+Q5BHHjf8nNcCNxCKMdl9/2/jKSFcrQ=" }, { "type": "url", "urls": [ "www.roudoudou.com/export/cpc/rasm/rasm_v0117_src.zip" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" }, { "type": "url", "outputHashMode": "flat", "urls": [ "http://downloads.sourceforge.net/project/nmon/lmon16n.c", "http://ufpr.dl.sourceforge.net/project/nmon/lmon16n.c", "http://netassist.dl.sourceforge.net/project/nmon/lmon16n.c" ], "integrity": "sha256-wAEswtkl3ulAw3zq4perrGS6Wlww5XXnQYsEAoYT9fI=" }, { "outputHash": "0s7p9swjqjsqddylmgid6cv263ggq7pmb734z4k84yfcrgb6kg4g", "outputHashAlgo": "sha256", "outputHashMode": "recursive", "type": "url", "urls": [ - "https://github.com/kandu/trie/archive/1.0.0.tar.gz" + "https://github.com/kandu/trie/archive/1.0.0.txz" ], "integrity": "sha256-j7xp1svMeYIm+WScVe/B7w0jNjMtvkp9a1hLLLlO92g=", "inferredFetcher": "fetchzip" }, { "type": "url", "urls": [ "https://github.com/trie/trie.git" ], "integrity": "sha256-j7xp1svMeYIm+WScVe/B7w0jNjMtvkp9a1hLLLlO92g=" }, { "type": "git", "git_url": "https://example.org/pali/0xffff", "git_ref": "0.9" }, { "type": "hg", "hg_url": "https://example.org/vityok/cl-string-match", "hg_changeset": "5048480a61243e6f1b02884012c8f25cdbee6d97" }, { "type": "svn", "svn_url": "https://code.call-cc.org/svn/chicken-eggs/release/5/iset/tags/2.2", "svn_revision": 39057 }, { "outputHash": "sha256-LxVcYj2WKHbhNu5x/DFkxQPOYrVkNvwiE/qcODq52Lc=", "outputHashAlgo": null, "outputHashMode": "recursive", "type": "url", "urls": [ - "https://github.com/julian-klode/triehash/archive/debian/0.3-3.tar.gz" + "https://github.com/julian-klode/triehash/archive/debian/0.3-3.tbz" ], "inferredFetcher": "fetchzip" }, { "type": "url", "urls": [ "http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz" ], "integrity": "sha256-G/7oY5qdCSJ59VlwHtIbvMdT6+mriXhMqQIHNx65J+E=" }, { "type": "url", "urls": ["svn://svn.code.sf.net/p/acme-crossass/code-0/trunk"], "integrity": "sha256-VifIQ+UEVMKJ+cNS+Xxusazinr5Cgu1lmGuhqj/5Mpk=" + }, + { + "outputHash": "0w2qkrrkzfy4h4jld18apypmbi8a8r89y2l11axlv808i2rg68fk", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/josefnpat/vapor/releases/download/0.2.3/vapor_dbf509f.love" + ], + "integrity": "sha256-0yHzsogIoE27CoEKn1BGCsVVr78KhUYlgcS7P3OeWHA=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "0rf06axz1hxssg942w2g66avak30jy6rfdwxynhriqv3vrf17bja", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "http://mirrors.jenkins.io/war-stable/2.303.1/jenkins.war" + ], + "integrity": "sha256-Sq4TXN5j45ih9Z03l42XYEy1lTFPcEHS07rD8LsywGU=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "1filqm050ixy53kdv81bd4n80vjvfapnmzizy7jg8a6pilv17gfc", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://files.pythonhosted.org/packages/py2.py3/g/geojson/geojson-2.5.0-py2.py3-none-any.whl" + ], + "integrity": "sha256-zL0TNo3XKPTk8T/+aq9yW26ALGkroN3mKL5HUEDFNLo=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "sha256:0i1cw0nfg24b0sg2yc3q7315ng5vc5245nvh0l1cndkn2c9z4978", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://stavekontrolden.dk/dictionaries/da_DK/da_DK-2.5.189.oxt" + ], + "integrity": "sha256-6CTyExN2NssCBXDbQkRhuzxbwjh4MC+eBouI5yzgLEQ=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "0y2HN4WGYUUXBfqp8Xb4oaA0hbLZmE3kDUXMBAOjvPQ=", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/microsoft/vscode-python/releases/download/2021.5.829140558/ms-python-release.vsix" + ], + "integrity": "sha256-0y2HN4WGYUUXBfqp8Xb4oaA0hbLZmE3kDUXMBAOjvPQ=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "08dfl5h1k6s542qw5qx2czm1wb37ck9w2vpjz44kp2az352nmksb", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://zxh404.gallery.vsassets.io/_apis/public/gallery/publisher/zxh404/extension/vscode-proto3/0.5.4/assetbyname/Microsoft.VisualStudio.Services.VSIXPackage" + ], + "integrity": "sha256-S89qRRlfiTsJ+fJuwdNkZywe6mei48KxIEWbGWChriE=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "0kaz8j85wjjnf18z0lz69xr1z8makg30jn2dzdyicd1asrj0q1jm", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/yvt/openspades/releases/download/v0.1.1b/NotoFonts.pak" + ], + "integrity": "sha256-VQYMZNYqNBZ9+01YCcabqqIfck/mU/BRcFZKXpBEX00=", + "inferredFetcher": "unclassified" + }, + { + "type": "url", + "urls": [ + "https://crates.io/api/v1/crates/syntect/4.6.0/download" + ], + "integrity": "sha256-iyCBW76A7gvgbmlXRQqEEYX89pD+AXjxTXegXOLKoDE=" + }, + { + "outputHash": "0x5l2pn4x92734k6i2wcjbn2klmwgkiqaajvxadh35k74dgnyh18", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://rubygems.org/gems/wdm-0.1.1.gem" + ], + "integrity": "sha256-KEBvXyNnlgGb6lsqheN8vNIp7JKMi2gmGUekTuwVtHQ=", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "2al10188nwrdmi9zk3bid4ijjfsa8ymh6m9hin5jsja7hx7anbvs3i2y7kall56h4qn7j1rj73f8499x3i2k6x53kszmksvd2a1pkd4", + "outputHashAlgo": "sha512", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://repo1.maven.org/maven2/org/codehaus/plexus/plexus-compiler-manager/2.4/plexus-compiler-manager-2.4.jar" + ], + "integrity": "sha512-pM0blGhbz/r1HKWbKeLoKRHkxpE5yGMxgaZQqubxIg69l1Wnw6OklsVGmKqB1SOlnZSRtLjG/CnWlrlFKIBAlQ==", + "inferredFetcher": "unclassified" + }, + { + "outputHash": "19mnq9a1yr16srqs8n6hddahr4f9d2gbpmld62pvlw1ps7nfrp9w", + "outputHashAlgo": "sha256", + "outputHashMode": "recursive", + "type": "url", + "urls": [ + "https://bitbucket.org/zandoye/charinfo_width/get/1.1.0.tar.bz2" + ], + "integrity": "sha256-PN3s7NE3cLqvMI3Wu55oyZEMVWvQWKRx1iZkH1TCtqY=", + "inferredFetcher": "fetchzip" + }, + { + "type": "url", + "urls": [ + "https://ftpmirror.gnu.org/gnu/texinfo/texinfo-4.13a.tar.lzma", + "ftp://ftp.cs.tu-berlin.de/pub/gnu/texinfo/texinfo-4.13a.tar.lzma" + ], + "integrity": "sha256-bSiwzq6GbjU2FC/FUuejvJ+EyDAxGcJXMbJHju9kyeU=" + }, + { + "type": "url", + "urls": [ + "https://download.savannah.gnu.org/releases/zutils/zutils-1.10.tar.lz", + "https://nongnu.freemirror.org/nongnu/zutils/zutils-1.10.tar.lz" + ], + "integrity": "sha256-DdRBOCktV1dkgDcZW2lFw99wsxYiG0KFUgrTjy6usZU=" + }, + { + "type": "url", + "urls": [ + "http://www.rle.mit.edu/cpg/codes/fasthenry-3.0-12Nov96.tar.z" + ], + "integrity": "sha256-8V9YKMP4A50xYvmFlzh5sbQv6L39hD+znfAD0rzvBqg=" + }, + { + "type": "url", + "urls": [ + "http://ftp.x.org/contrib/utilities/unclutter-8.tar.Z" + ], + "integrity": "sha256-uFWnjURlqy+GKH6srGOnPxUEsIUihAqjdxh3bn7JGSo=" + }, + { + "outputHash": "sha256-Y40oLjddunrd7ZF1JbCcgjSCn8jFTubq69jhAVxInXw=", + "outputHashAlgo": "sha256", + "outputHashMode": "flat", + "type": "url", + "urls": [ + "https://github.com/vk-cli/vk/releases/download/0.7.6/vk-0.7.6-64-bin.7z" + ], + "integrity": "sha256-Y40oLjddunrd7ZF1JbCcgjSCn8jFTubq69jhAVxInXw=", + "inferredFetcher": "unclassified" + }, + { + "type": "url", + "urls": [ + "https://github.com/Doom-Utils/deutex/releases/download/v5.2.2/deutex-5.2.2.tar.zst" + ], + "integrity": "sha256-EO0OelM+yXy20DVI1CWPvsiIUqRbXqTPVDQ3atQXS18=" + }, + { + "type": "url", + "urls": [ + "https://codeload.github.com/fifengine/fifechan/tar.gz/0.1.5" + ], + "integrity": "sha256-Kb5f9LN54vxPiO99i8FyNCEw3T53owYfZMinXv5OunM=" + }, + { + "type": "url", + "urls": [ + "https://codeload.github.com/unknown-horizons/unknown-horizons/tar.gz/2019.1" + ], + "integrity": "sha256-pBf9PTQiEv0ZDk8hvoLvE8EOHtfCiPu+RuRiAM895Ng=" + }, + { + "type": "url", + "urls": [ + "https://codeload.github.com/fifengine/fifengine/tar.gz/0.4.2" + ], + "integrity": "sha256-6IK1W++jauLxqJraFq8PgUobePfL5gIexbFgVgTPj/g=" } ], "version": "1", "revision": "cc4e04c26672dd74e5fd0fecb78b435fb55368f7" } diff --git a/swh/lister/nixguix/tests/test_lister.py b/swh/lister/nixguix/tests/test_lister.py index cadb65e..fdb7210 100644 --- a/swh/lister/nixguix/tests/test_lister.py +++ b/swh/lister/nixguix/tests/test_lister.py @@ -1,309 +1,381 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from collections import defaultdict import json import logging from pathlib import Path from typing import Dict, List +from urllib.parse import urlparse import pytest import requests from requests.exceptions import ConnectionError, InvalidSchema, SSLError from swh.lister import TARBALL_EXTENSIONS from swh.lister.nixguix.lister import ( + DEFAULT_EXTENSIONS_TO_IGNORE, POSSIBLE_TARBALL_MIMETYPES, ArtifactNatureMistyped, ArtifactNatureUndetected, + ArtifactWithoutExtension, NixGuixLister, is_tarball, + url_endswith, ) from swh.lister.pattern import ListerStats logger = logging.getLogger(__name__) SOURCES = { "guix": { "repo": "https://git.savannah.gnu.org/cgit/guix.git/", "manifest": "https://guix.gnu.org/sources.json", }, "nixpkgs": { "repo": "https://github.com/NixOS/nixpkgs", "manifest": "https://nix-community.github.io/nixpkgs-swh/sources-unstable.json", }, } def page_response(datadir, instance: str = "success") -> List[Dict]: """Return list of repositories (out of test dataset)""" datapath = Path(datadir, f"sources-{instance}.json") return json.loads(datapath.read_text()) if datapath.exists else [] +@pytest.mark.parametrize( + "name,expected_result", + [(f"one.{ext}", True) for ext in TARBALL_EXTENSIONS] + + [(f"one.{ext}?foo=bar", True) for ext in TARBALL_EXTENSIONS] + + [(f"one?p0=1&foo=bar.{ext}", True) for ext in DEFAULT_EXTENSIONS_TO_IGNORE] + + [ + ("two?file=something.el", False), + ("foo?two=two&three=three", False), + ("v1.2.3", False), # with raise_when_no_extension is False + ("2048-game-20151026.1233", False), + ("v2048-game-20151026.1233", False), + ], +) +def test_url_endswith(name, expected_result): + """It should detect whether url or query params of the urls ends with extensions""" + urlparsed = urlparse(f"https://example.org/{name}") + assert ( + url_endswith( + urlparsed, + TARBALL_EXTENSIONS + DEFAULT_EXTENSIONS_TO_IGNORE, + raise_when_no_extension=False, + ) + is expected_result + ) + + +@pytest.mark.parametrize( + "name", ["foo?two=two&three=three", "tar.gz/0.1.5", "tar.gz/v10.3.1"] +) +def test_url_endswith_raise(name): + """It should raise when the tested url has no extension""" + urlparsed = urlparse(f"https://example.org/{name}") + with pytest.raises(ArtifactWithoutExtension): + url_endswith(urlparsed, ["unimportant"]) + + @pytest.mark.parametrize( "tarballs", [[f"one.{ext}", f"two.{ext}"] for ext in TARBALL_EXTENSIONS] + [[f"one.{ext}?foo=bar"] for ext in TARBALL_EXTENSIONS], ) def test_is_tarball_simple(tarballs): """Simple check on tarball should discriminate between tarball and file""" urls = [f"https://example.org/{tarball}" for tarball in tarballs] is_tar, origin = is_tarball(urls) assert is_tar is True assert origin == urls[0] @pytest.mark.parametrize( "query_param", ["file", "f", "url", "name", "anykeyreally"], ) def test_is_tarball_not_so_simple(query_param): """More involved check on tarball should discriminate between tarball and file""" url = f"https://example.org/download.php?foo=bar&{query_param}=one.tar.gz" is_tar, origin = is_tarball([url]) assert is_tar is True assert origin == url @pytest.mark.parametrize( "files", [ ["abc.lisp"], ["one.abc", "two.bcd"], ["abc.c", "other.c"], ["one.scm?foo=bar", "two.scm?foo=bar"], ["config.nix", "flakes.nix"], ], ) def test_is_tarball_simple_not_tarball(files): """Simple check on tarball should discriminate between tarball and file""" urls = [f"http://example.org/{file}" for file in files] is_tar, origin = is_tarball(urls) assert is_tar is False assert origin == urls[0] def test_is_tarball_complex_with_no_result(requests_mock): """Complex tarball detection without proper information should fail.""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] with pytest.raises(ArtifactNatureUndetected): is_tarball(urls) # no request parameter, this cannot fallback, raises with pytest.raises(ArtifactNatureUndetected): requests_mock.head( url, status_code=404, # not found so cannot detect anything ) is_tarball(urls, requests) with pytest.raises(ArtifactNatureUndetected): requests_mock.head( url, headers={} ) # response ok without headers, cannot detect anything is_tarball(urls, requests) with pytest.raises(ArtifactNatureUndetected): fallback_url = "https://example.org/mirror/crates/package/download" requests_mock.head( url, headers={"location": fallback_url} # still no extension, cannot detect ) is_tarball(urls, requests) with pytest.raises(ArtifactNatureMistyped): is_tarball(["foo://example.org/unsupported-scheme"]) with pytest.raises(ArtifactNatureMistyped): fallback_url = "foo://example.org/unsupported-scheme" requests_mock.head( url, headers={"location": fallback_url} # still no extension, cannot detect ) is_tarball(urls, requests) @pytest.mark.parametrize( "fallback_url, expected_result", [ ("https://example.org/mirror/crates/package/download.tar.gz", True), ("https://example.org/mirror/package/download.lisp", False), ], ) def test_is_tarball_complex_with_location_result( requests_mock, fallback_url, expected_result ): """Complex tarball detection with information should detect artifact nature""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] # One scenario where the url renders a location with a proper extension requests_mock.head(url, headers={"location": fallback_url}) is_tar, origin = is_tarball(urls, requests) assert is_tar == expected_result if is_tar: assert origin == fallback_url @pytest.mark.parametrize( "content_type, expected_result", [("application/json", False), ("application/something", False)] + [(ext, True) for ext in POSSIBLE_TARBALL_MIMETYPES], ) def test_is_tarball_complex_with_content_type_result( requests_mock, content_type, expected_result ): """Complex tarball detection with information should detect artifact nature""" # No extension, this won't detect immediately the nature of the url url = "https://example.org/crates/package/download" urls = [url] # One scenario where the url renders a location with a proper extension requests_mock.head(url, headers={"Content-Type": content_type}) is_tar, origin = is_tarball(urls, requests) assert is_tar == expected_result if is_tar: assert origin == url def test_lister_nixguix_ok(datadir, swh_scheduler, requests_mock): """NixGuixLister should list all origins per visit type""" url = SOURCES["guix"]["manifest"] origin_upstream = SOURCES["guix"]["repo"] lister = NixGuixLister(swh_scheduler, url=url, origin_upstream=origin_upstream) response = page_response(datadir, "success") requests_mock.get( url, [{"json": response}], ) requests_mock.get( "https://api.github.com/repos/trie/trie", [{"json": {"html_url": "https://github.com/trie/trie.git"}}], ) requests_mock.head( "http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz", headers={"Content-Type": "application/gzip; charset=ISO-8859-1"}, ) + requests_mock.head( + "https://crates.io/api/v1/crates/syntect/4.6.0/download", + headers={ + "Location": "https://static.crates.io/crates/syntect/syntect-4.6.0.crate" + }, + ) + requests_mock.head( + "https://codeload.github.com/fifengine/fifechan/tar.gz/0.1.5", + headers={ + "Content-Type": "application/x-gzip", + }, + ) + requests_mock.head( + "https://codeload.github.com/unknown-horizons/unknown-horizons/tar.gz/2019.1", + headers={ + "Content-Disposition": "attachment; filename=unknown-horizons-2019.1.tar.gz", + }, + ) + requests_mock.head( + "https://codeload.github.com/fifengine/fifengine/tar.gz/0.4.2", + headers={ + "Content-Disposition": "attachment; name=fieldName; " + "filename=fifengine-0.4.2.tar.gz; other=stuff", + }, + ) expected_visit_types = defaultdict(int) # origin upstream is added as origin expected_nb_origins = 1 expected_visit_types["git"] += 1 for artifact in response["sources"]: # Each artifact is considered an origin (even "url" artifacts with mirror urls) expected_nb_origins += 1 artifact_type = artifact["type"] if artifact_type in [ "git", "svn", "hg", ]: expected_visit_types[artifact_type] += 1 elif artifact_type == "url": url = artifact["urls"][0] if url.endswith(".git"): expected_visit_types["git"] += 1 elif url.endswith(".c") or url.endswith(".txt"): expected_visit_types["content"] += 1 elif url.startswith("svn"): # mistyped artifact rendered as vcs nonetheless expected_visit_types["svn"] += 1 - else: + elif "crates.io" in url or "codeload.github.com" in url: + expected_visit_types["directory"] += 1 + else: # tarball artifacts expected_visit_types["directory"] += 1 assert set(expected_visit_types.keys()) == { "content", "git", "svn", "hg", "directory", } listed_result = lister.run() # 1 page read is 1 origin nb_pages = expected_nb_origins assert listed_result == ListerStats(pages=nb_pages, origins=expected_nb_origins) scheduler_origins = lister.scheduler.get_listed_origins( lister.lister_obj.id ).results assert len(scheduler_origins) == expected_nb_origins mapping_visit_types = defaultdict(int) for listed_origin in scheduler_origins: assert listed_origin.visit_type in expected_visit_types # no last update is listed on those manifests assert listed_origin.last_update is None mapping_visit_types[listed_origin.visit_type] += 1 assert dict(mapping_visit_types) == expected_visit_types def test_lister_nixguix_mostly_noop(datadir, swh_scheduler, requests_mock): - """NixGuixLister should ignore unsupported or incomplete origins""" + """NixGuixLister should ignore unsupported or incomplete or to ignore origins""" url = SOURCES["nixpkgs"]["manifest"] origin_upstream = SOURCES["nixpkgs"]["repo"] - lister = NixGuixLister(swh_scheduler, url=url, origin_upstream=origin_upstream) + lister = NixGuixLister( + swh_scheduler, + url=url, + origin_upstream=origin_upstream, + extensions_to_ignore=["foobar"], + ) response = page_response(datadir, "failure") requests_mock.get( url, [{"json": response}], ) # Amongst artifacts, this url does not allow to determine its nature (tarball, file) # It's ending up doing a http head query which ends up being 404, so it's skipped. requests_mock.head( "https://crates.io/api/v1/0.1.5/no-extension-and-head-404-so-skipped", status_code=404, ) # Invalid schema for that origin (and no extension), so skip origin # from its name requests_mock.head( "ftp://ftp.ourproject.org/file-with-no-extension", exc=InvalidSchema, ) # Cannot communicate with an expired cert, so skip origin requests_mock.head( "https://code.9front.org/hg/plan9front", exc=SSLError, ) # Cannot connect to the site, so skip origin requests_mock.head( "https://git-tails.immerda.ch/onioncircuits", exc=ConnectionError, ) listed_result = lister.run() # only the origin upstream is listed, every other entries are unsupported or incomplete assert listed_result == ListerStats(pages=1, origins=1) scheduler_origins = lister.scheduler.get_listed_origins( lister.lister_obj.id ).results assert len(scheduler_origins) == 1 assert scheduler_origins[0].visit_type == "git" def test_lister_nixguix_fail(datadir, swh_scheduler, requests_mock): url = SOURCES["nixpkgs"]["manifest"] origin_upstream = SOURCES["nixpkgs"]["repo"] lister = NixGuixLister(swh_scheduler, url=url, origin_upstream=origin_upstream) requests_mock.get( url, status_code=404, ) with pytest.raises(requests.HTTPError): # listing cannot continues so stop lister.run() scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == 0 diff --git a/swh/lister/puppet/__init__.py b/swh/lister/puppet/__init__.py index e56cee6..3e5e28d 100644 --- a/swh/lister/puppet/__init__.py +++ b/swh/lister/puppet/__init__.py @@ -1,101 +1,108 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Puppet lister ============= The Puppet lister list origins from `Puppet Forge`_. Puppet Forge is a package manager for Puppet modules. As of September 2022 `Puppet Forge`_ list 6917 package names. Origins retrieving strategy --------------------------- To get a list of all package names we call an `http api endpoint`_ which have a `getModules`_ endpoint. It returns a paginated list of results and a `next` url. The api follow `OpenApi 3.0 specifications`. Page listing ------------ Each page returns a list of ``results`` which are raw data from api response. The results size is 100 as 100 is the maximum limit size allowed by the api. Origins from page ----------------- The lister yields one hundred origin url per page. Origin url is the html page corresponding to a package name on the forge, following this pattern:: "https://forge.puppet.com/modules/{owner}/{pkgname}" -For each origin `last_update`is set via the module "updated_at" value. +For each origin `last_update` is set via the module "updated_at" value. As the api also returns all existing versions for a package, we build an `artifacts` dict in `extra_loader_arguments` with the archive tarball corresponding to each existing versions. Example for ``file_concat`` module located at https://forge.puppet.com/modules/electrical/file_concat:: { - "artifacts": { - "1.0.0": { - "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.0.tar.gz", # noqa: B950 - "version": "1.0.0", - "filename": "electrical-file_concat-1.0.0.tar.gz", - "last_update": "2015-04-09T12:03:13-07:00", - }, - "1.0.1": { + "artifacts": [ + { "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.1.tar.gz", # noqa: B950 "version": "1.0.1", "filename": "electrical-file_concat-1.0.1.tar.gz", "last_update": "2015-04-17T01:03:46-07:00", + "checksums": { + "md5": "74901a89544134478c2dfde5efbb7f14", + "sha256": "15e973613ea038d8a4f60bafe2d678f88f53f3624c02df3157c0043f4a400de6", # noqa: B950 + }, + }, + { + "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.0.tar.gz", # noqa: B950 + "version": "1.0.0", + "filename": "electrical-file_concat-1.0.0.tar.gz", + "last_update": "2015-04-09T12:03:13-07:00", + "checksums": { + "length": 13289, + }, }, - } + ], } Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/puppet/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: docker compose up -d Then schedule a Puppet listing task:: docker compose exec swh-scheduler swh scheduler task add -p oneshot list-puppet You can follow lister execution by displaying logs of swh-lister service:: docker compose logs -f swh-lister .. _Puppet Forge: https://forge.puppet.com/ .. _http api endpoint: https://forgeapi.puppet.com/ .. _getModules: https://forgeapi.puppet.com/#tag/Module-Operations/operation/getModules """ def register(): from .lister import PuppetLister return { "lister": PuppetLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/puppet/lister.py b/swh/lister/puppet/lister.py index 4982e92..80ac3da 100644 --- a/swh/lister/puppet/lister.py +++ b/swh/lister/puppet/lister.py @@ -1,111 +1,113 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime import logging from typing import Any, Dict, Iterator, List, Optional from urllib.parse import urljoin from swh.scheduler.interface import SchedulerInterface from swh.scheduler.model import ListedOrigin from ..pattern import CredentialsType, StatelessLister logger = logging.getLogger(__name__) # Aliasing the page results returned by `get_pages` method from the lister. PuppetListerPage = List[Dict[str, Any]] class PuppetLister(StatelessLister[PuppetListerPage]): """The Puppet lister list origins from 'Puppet Forge'""" LISTER_NAME = "puppet" VISIT_TYPE = "puppet" INSTANCE = "puppet" BASE_URL = "https://forgeapi.puppet.com/" def __init__( self, scheduler: SchedulerInterface, credentials: Optional[CredentialsType] = None, ): super().__init__( scheduler=scheduler, credentials=credentials, instance=self.INSTANCE, url=self.BASE_URL, ) def get_pages(self) -> Iterator[PuppetListerPage]: """Yield an iterator which returns 'page' It request the http api endpoint to get a paginated results of modules, and retrieve a `next` url. It ends when `next` json value is `null`. Open Api specification for getModules endpoint: https://forgeapi.puppet.com/#tag/Module-Operations/operation/getModules """ # limit = 100 is the max value for pagination limit: int = 100 response = self.http_request( f"{self.BASE_URL}v3/modules", params={"limit": limit} ) data: Dict[str, Any] = response.json() yield data["results"] while data["pagination"]["next"]: response = self.http_request( urljoin(self.BASE_URL, data["pagination"]["next"]) ) data = response.json() yield data["results"] def get_origins_from_page(self, page: PuppetListerPage) -> Iterator[ListedOrigin]: """Iterate on all pages and yield ListedOrigin instances.""" assert self.lister_obj.id is not None dt_parse_pattern = "%Y-%m-%d %H:%M:%S %z" for entry in page: last_update = datetime.strptime(entry["updated_at"], dt_parse_pattern) pkgname = entry["name"] owner = entry["owner"]["slug"] url = f"https://forge.puppet.com/modules/{owner}/{pkgname}" - artifacts = {} + artifacts = [] for release in entry["releases"]: # Build an artifact entry following original-artifacts-json specification # https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json # noqa: B950 checksums = {} if release["version"] == entry["current_release"]["version"]: # checksums are only available for current release for checksum in ("md5", "sha256"): checksums[checksum] = entry["current_release"][ f"file_{checksum}" ] else: # use file length as basic content check instead checksums["length"] = release["file_size"] - artifacts[release["version"]] = { - "filename": release["file_uri"].split("/")[-1], - "url": urljoin(self.BASE_URL, release["file_uri"]), - "version": release["version"], - "last_update": datetime.strptime( - release["created_at"], dt_parse_pattern - ).isoformat(), - "checksums": checksums, - } + artifacts.append( + { + "filename": release["file_uri"].split("/")[-1], + "url": urljoin(self.BASE_URL, release["file_uri"]), + "version": release["version"], + "last_update": datetime.strptime( + release["created_at"], dt_parse_pattern + ).isoformat(), + "checksums": checksums, + } + ) yield ListedOrigin( lister_id=self.lister_obj.id, visit_type=self.VISIT_TYPE, url=url, last_update=last_update, extra_loader_arguments={"artifacts": artifacts}, ) diff --git a/swh/lister/puppet/tests/test_lister.py b/swh/lister/puppet/tests/test_lister.py index 5dbfd89..80e5a63 100644 --- a/swh/lister/puppet/tests/test_lister.py +++ b/swh/lister/puppet/tests/test_lister.py @@ -1,106 +1,120 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from swh.lister.puppet.lister import PuppetLister # flake8: noqa: B950 -expected_origins = { - "https://forge.puppet.com/modules/electrical/file_concat": { - "artifacts": { - "1.0.0": { - "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.0.tar.gz", - "version": "1.0.0", - "filename": "electrical-file_concat-1.0.0.tar.gz", - "last_update": "2015-04-09T12:03:13-07:00", - "checksums": { - "length": 13289, - }, - }, - "1.0.1": { +expected_origins = [ + { + "url": "https://forge.puppet.com/modules/electrical/file_concat", + "artifacts": [ + { "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.1.tar.gz", "version": "1.0.1", "filename": "electrical-file_concat-1.0.1.tar.gz", "last_update": "2015-04-17T01:03:46-07:00", "checksums": { "md5": "74901a89544134478c2dfde5efbb7f14", "sha256": "15e973613ea038d8a4f60bafe2d678f88f53f3624c02df3157c0043f4a400de6", }, }, - } - }, - "https://forge.puppet.com/modules/puppetlabs/puppetdb": { - "artifacts": { - "1.0.0": { - "url": "https://forgeapi.puppet.com/v3/files/puppetlabs-puppetdb-1.0.0.tar.gz", + { + "url": "https://forgeapi.puppet.com/v3/files/electrical-file_concat-1.0.0.tar.gz", "version": "1.0.0", - "filename": "puppetlabs-puppetdb-1.0.0.tar.gz", - "last_update": "2012-09-19T16:51:22-07:00", - "checksums": { - "length": 16336, - }, - }, - "7.9.0": { - "url": "https://forgeapi.puppet.com/v3/files/puppetlabs-puppetdb-7.9.0.tar.gz", - "version": "7.9.0", - "filename": "puppetlabs-puppetdb-7.9.0.tar.gz", - "last_update": "2021-06-24T07:48:54-07:00", + "filename": "electrical-file_concat-1.0.0.tar.gz", + "last_update": "2015-04-09T12:03:13-07:00", "checksums": { - "length": 42773, + "length": 13289, }, }, - "7.10.0": { + ], + }, + { + "url": "https://forge.puppet.com/modules/puppetlabs/puppetdb", + "artifacts": [ + { "url": "https://forgeapi.puppet.com/v3/files/puppetlabs-puppetdb-7.10.0.tar.gz", "version": "7.10.0", "filename": "puppetlabs-puppetdb-7.10.0.tar.gz", "last_update": "2021-12-16T14:57:46-08:00", "checksums": { "md5": "e91a2074ca8d94a8b3ff7f6c8bbf12bc", "sha256": "49b1a542fbd2a1378c16cb04809e0f88bf4f3e45979532294fb1f03f56c97fbb", }, }, - } - }, - "https://forge.puppet.com/modules/saz/memcached": { - "artifacts": { - "1.0.0": { - "url": "https://forgeapi.puppet.com/v3/files/saz-memcached-1.0.0.tar.gz", + { + "url": "https://forgeapi.puppet.com/v3/files/puppetlabs-puppetdb-7.9.0.tar.gz", + "version": "7.9.0", + "filename": "puppetlabs-puppetdb-7.9.0.tar.gz", + "last_update": "2021-06-24T07:48:54-07:00", + "checksums": { + "length": 42773, + }, + }, + { + "url": "https://forgeapi.puppet.com/v3/files/puppetlabs-puppetdb-1.0.0.tar.gz", "version": "1.0.0", - "filename": "saz-memcached-1.0.0.tar.gz", - "last_update": "2011-11-20T13:40:30-08:00", + "filename": "puppetlabs-puppetdb-1.0.0.tar.gz", + "last_update": "2012-09-19T16:51:22-07:00", "checksums": { - "length": 2472, + "length": 16336, }, }, - "8.1.0": { + ], + }, + { + "url": "https://forge.puppet.com/modules/saz/memcached", + "artifacts": [ + { "url": "https://forgeapi.puppet.com/v3/files/saz-memcached-8.1.0.tar.gz", "version": "8.1.0", "filename": "saz-memcached-8.1.0.tar.gz", "last_update": "2022-07-11T03:34:55-07:00", "checksums": { "md5": "aadf80fba5848909429eb002ee1927ea", "sha256": "883d6186e91c2c3fed13ae2009c3aa596657f6707b76f1f7efc6203c6e4ae986", }, }, - } + { + "url": "https://forgeapi.puppet.com/v3/files/saz-memcached-1.0.0.tar.gz", + "version": "1.0.0", + "filename": "saz-memcached-1.0.0.tar.gz", + "last_update": "2011-11-20T13:40:30-08:00", + "checksums": { + "length": 2472, + }, + }, + ], }, -} +] def test_puppet_lister(datadir, requests_mock_datadir, swh_scheduler): lister = PuppetLister(scheduler=swh_scheduler) res = lister.run() assert res.pages == 2 assert res.origins == 1 + 1 + 1 scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results assert len(scheduler_origins) == len(expected_origins) - for origin in scheduler_origins: - assert origin.visit_type == "puppet" - assert origin.url in expected_origins - assert origin.extra_loader_arguments == expected_origins[origin.url] + assert [ + ( + scheduled.visit_type, + scheduled.url, + scheduled.extra_loader_arguments["artifacts"], + ) + for scheduled in sorted(scheduler_origins, key=lambda scheduled: scheduled.url) + ] == [ + ( + "puppet", + expected["url"], + expected["artifacts"], + ) + for expected in sorted(expected_origins, key=lambda expected: expected["url"]) + ]