diff --git a/swh/lister/arch/__init__.py b/swh/lister/arch/__init__.py index 276e4d2..30d7ae3 100644 --- a/swh/lister/arch/__init__.py +++ b/swh/lister/arch/__init__.py @@ -1,226 +1,226 @@ # Copyright (C) 2022 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Arch Linux lister ================= The Arch lister list origins from `archlinux.org`_, the official Arch Linux packages, and from `archlinuxarm.org`_, the Arch Linux ARM packages, an unofficial port for arm. Packages are put in three different repositories, `core`, `extra` and `community`. To manage listing those origins, this lister must be instantiated with a `flavours` dict. `flavours` default values:: "official": { "archs": ["x86_64"], "repos": ["core", "extra", "community"], "base_info_url": "https://archlinux.org", "base_archive_url": "https://archive.archlinux.org", "base_mirror_url": "", "base_api_url": "https://archlinux.org", }, "arm": { "archs": ["armv7h", "aarch64"], "repos": ["core", "extra", "community"], "base_info_url": "https://archlinuxarm.org", "base_archive_url": "", "base_mirror_url": "https://uk.mirror.archlinuxarm.org", "base_api_url": "", } From official Arch Linux repositories we can list all packages and all released versions. They provide an api and archives. From Arch Linux ARM repositories we can list all packages at their latest versions, they do not provide api or archives. As of August 2022 `archlinux.org`_ list 12592 packages and `archlinuxarm.org` 24044 packages. Please note that those amounts are the total of `regular`_ and `split`_ packages. Origins retrieving strategy --------------------------- Download repositories archives as tar.gz files from https://archive.archlinux.org/repos/last/, extract to a temp directory and then walks through each 'desc' files. Repository archive index url example for Arch Linux `core repository`_ and Arch Linux ARM `extra repository`_. Each 'desc' file describe the latest released version of a package and helps to build an origin url and `package versions url`_ from where scrapping artifacts metadata and get a list of versions. For Arch Linux ARM it follow the same discovery process parsing 'desc' files. The main difference is that we can't get existing versions of an arm package because https://archlinuxarm.org does not have an 'archive' website or api. Page listing ------------ Each page is a list of package belonging to a flavour ('official', 'arm'), and a repo ('core', 'extra', 'community'). Each line of a page represents an origin url for a package name with related metadata and versions. Origin url examples: * **Arch Linux**: https://archlinux.org/packages/extra/x86_64/mercurial * **Arch Linux ARM**: https://archlinuxarm.org/packages/armv7h/mercurial The data schema for each line is: * **name**: Package name * **version**: Last released package version * **last_modified**: Iso8601 last modified date from timestamp * **url**: Origin url * **data**: Package metadata dict * **versions**: A list of dict with artifacts metadata for each versions The data schema for `versions` within a line: * **name**: Package name * **version**: Package version * **repo**: One of core, extra, community * **arch**: Processor architecture targeted * **filename**: Filename of the archive to download * **url**: Package download url * **last_modified**: Iso8601 last modified date from timestamp, used as publication date for this version * **length**: Length of the archive to download Origins from page ----------------- The origin url corresponds to: * **Arch Linux**: https://archlinux.org/packages/{repo}/{arch}/{name} * **Arch Linux ARM**: https://archlinuxarm.org/packages/{arch}/{name} Additionally we add some data set to "extra_loader_arguments": * **artifacts**: Represent data about the Arch Linux package archive to download, following :ref:`original-artifacts-json specification ` * **arch_metadata**: To store all other interesting attributes that do not belongs to artifacts. Origin data example Arch Linux official:: { "url": "https://archlinux.org/packages/extra/x86_64/mercurial", "visit_type": "arch", "extra_loader_arguments": { "artifacts": [ { "url": "https://archive.archlinux.org/packages/m/mercurial/mercurial-4.8.2-1-x86_64.pkg.tar.xz", # noqa: B950 "version": "4.8.2-1", "length": 4000000, "filename": "mercurial-4.8.2-1-x86_64.pkg.tar.xz", }, { "url": "https://archive.archlinux.org/packages/m/mercurial/mercurial-4.9-1-x86_64.pkg.tar.xz", # noqa: B950 "version": "4.9-1", "length": 4000000, "filename": "mercurial-4.9-1-x86_64.pkg.tar.xz", }, { "url": "https://archive.archlinux.org/packages/m/mercurial/mercurial-4.9.1-1-x86_64.pkg.tar.xz", # noqa: B950 "version": "4.9.1-1", "length": 4000000, "filename": "mercurial-4.9.1-1-x86_64.pkg.tar.xz", }, ... ], "arch_metadata": [ { "arch": "x86_64", "repo": "extra", "name": "mercurial", "version": "4.8.2-1", "last_modified": "2019-01-15T20:31:00", }, { "arch": "x86_64", "repo": "extra", "name": "mercurial", "version": "4.9-1", "last_modified": "2019-02-12T06:15:00", }, { "arch": "x86_64", "repo": "extra", "name": "mercurial", "version": "4.9.1-1", "last_modified": "2019-03-30T17:40:00", }, ], }, }, Origin data example Arch Linux ARM:: { "url": "https://archlinuxarm.org/packages/armv7h/mercurial", "visit_type": "arch", "extra_loader_arguments": { "artifacts": [ { "url": "https://uk.mirror.archlinuxarm.org/armv7h/extra/mercurial-6.1.3-1-armv7h.pkg.tar.xz", # noqa: B950 "length": 4897816, "version": "6.1.3-1", "filename": "mercurial-6.1.3-1-armv7h.pkg.tar.xz", } ], "arch_metadata": [ { "arch": "armv7h", "name": "mercurial", "repo": "extra", "version": "6.1.3-1", "last_modified": "2022-06-02T22:13:08", } ], }, }, Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/arch/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: - docker-compose up -d + docker compose up -d -Then connect to the lister:: +Then schedule an arch listing task:: - docker exec -it docker_swh-lister_1 bash + docker compose exec swh-scheduler swh scheduler task add -p oneshot list-arch -And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: +You can follow lister execution by displaying logs of swh-lister service:: - swh lister run -l arch + docker compose logs -f swh-lister .. _archlinux.org: https://archlinux.org/packages/ .. _archlinuxarm.org: https://archlinuxarm.org/packages/ .. _core repository: https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz .. _extra repository: https://uk.mirror.archlinuxarm.org/armv7h/extra/extra.files.tar.gz .. _package versions url: https://archive.archlinux.org/packages/m/mercurial/ .. _regular: https://wiki.archlinux.org/title/PKGBUILD#Package_name .. _split: https://man.archlinux.org/man/PKGBUILD.5#PACKAGE_SPLITTING """ def register(): from .lister import ArchLister return { "lister": ArchLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/aur/__init__.py b/swh/lister/aur/__init__.py index 833c72b..b4ded88 100644 --- a/swh/lister/aur/__init__.py +++ b/swh/lister/aur/__init__.py @@ -1,135 +1,135 @@ # Copyright (C) 2022 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ AUR (Arch User Repository) lister ================================= The AUR lister list origins from `aur.archlinux.org`_, the Arch User Repository. For each package, there is a git repository, we use the git url as origin and the snapshot url as the artifact for the loader to download. Each git repository consist of a directory (for which name corresponds to the package name), and at least two files, .SRCINFO and PKGBUILD which are recipes for building the package. Each package has a version, the latest one. There isn't any archives of previous versions, so the lister will always list one version per package. As of August 2022 `aur.archlinux.org`_ list 84438 packages. Please note that this amount is the total of `regular`_ and `split`_ packages. We will archive `regular` and `split` packages but only their `pkgbase` because that is the only one that actually has source code. The packages amount is 78554 after removing the split ones. Origins retrieving strategy --------------------------- An rpc api exists but it is recommended to save bandwidth so it's not used. See `New AUR Metadata Archives`_ for more on this topic. To get an index of all AUR existing packages we download a `packages-meta-v1.json.gz`_ which contains a json file listing all existing packages definitions. Each entry describes the latest released version of a package. The origin url for a package is built using `pkgbase` and corresponds to a git repository. Note that we list only standard package (when pkgbase equal pkgname), not the ones belonging to split packages. It takes only a couple of minutes to download the 7 MB index archive and parses its content. Page listing ------------ Each page is related to one package. As its not possible to get all previous versions, it will always returns one line. Each page corresponds to a package with a `version`, an `url` for a Git repository, a `project_url` which represents the upstream project url and a canonical `snapshot_url` from which a tar.gz archive of the package can be downloaded. The data schema for each line is: * **pkgname**: Package name * **version**: Package version * **url**: Git repository url for a package * **snapshot_url**: Package download url * **project_url**: Upstream project url if any * **last_modified**: Iso8601 last update date Origins from page ----------------- The lister yields one origin per page. The origin url corresponds to the git url of a package, for example ``https://aur.archlinux.org/{package}.git``. Additionally we add some data set to "extra_loader_arguments": * **artifacts**: Represent data about the Aur package snapshot to download, following :ref:`original-artifacts-json specification ` * **aur_metadata**: To store all other interesting attributes that do not belongs to artifacts. Origin data example:: { "visit_type": "aur", "url": "https://aur.archlinux.org/hg-evolve.git", "extra_loader_arguments": { "artifacts": [ { "filename": "hg-evolve.tar.gz", "url": "https://aur.archlinux.org/cgit/aur.git/snapshot/hg-evolve.tar.gz", # noqa: B950 "version": "10.5.1-1", } ], "aur_metadata": [ { "version": "10.5.1-1", "project_url": "https://www.mercurial-scm.org/doc/evolution/", "last_update": "2022-04-27T20:02:56+00:00", "pkgname": "hg-evolve", } ], }, Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/aur/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: - docker-compose up -d + docker compose up -d -Then connect to the lister:: +Then schedule an aur listing task:: - docker exec -it docker_swh-lister_1 bash + docker compose exec swh-scheduler swh scheduler task add -p oneshot list-aur -And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: +You can follow lister execution by displaying logs of swh-lister service:: - swh lister run -l aur + docker compose logs -f swh-lister .. _aur.archlinux.org: https://aur.archlinux.org .. _New AUR Metadata Archives: https://lists.archlinux.org/pipermail/aur-general/2021-November/036659.html .. _packages-meta-v1.json.gz: https://aur.archlinux.org/packages-meta-v1.json.gz .. _regular: https://wiki.archlinux.org/title/PKGBUILD#Package_name .. _split: https://man.archlinux.org/man/PKGBUILD.5#PACKAGE_SPLITTING """ def register(): from .lister import AurLister return { "lister": AurLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/bower/__init__.py b/swh/lister/bower/__init__.py index 1f1c017..cdf11f2 100644 --- a/swh/lister/bower/__init__.py +++ b/swh/lister/bower/__init__.py @@ -1,76 +1,76 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Bower lister ============ The `Bower`_ lister list origins from its packages registry `registry.bower.io`_. Bower is a tool to manage Javascript packages. The registry provide an `http api`_ from where the lister retrieve package names and url. As of August 2022 `registry.bower.io`_ list 71028 package names. Note that even if the project is still maintained(security fixes, no new features), it is recommended to not use it anymore and prefer Yarn as a replacement since 2018. Origins retrieving strategy --------------------------- To get a list of all package names we call `https://registry.bower.io/packages` endpoint. There is no other way for discovery (no archive index, no database dump, no dvcs repository). Page listing ------------ There is only one page that list all origins url. Origins from page ----------------- The lister yields all origins url from one page. It is a list of package name and url. Origins url corresponds to Git repository url. Bower is supposed to support Svn repository too but on +/- 71000 urls I have only found 35 urls that may not be Git repository. Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/bower/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: - docker-compose up -d + docker compose up -d -Then connect to the lister:: +Then schedule a bower listing task:: - docker exec -it docker_swh-lister_1 bash + docker compose exec swh-scheduler swh scheduler task add -p oneshot list-bower -And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: +You can follow lister execution by displaying logs of swh-lister service:: - swh lister run -l bower + docker compose logs -f swh-lister .. _Bower: https://bower.io .. _registry.bower.io: https://registry.bower.io .. _http api: https://registry.bower.io/packages """ def register(): from .lister import BowerLister return { "lister": BowerLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/crates/__init__.py b/swh/lister/crates/__init__.py index c4ca72c..6fd08e6 100644 --- a/swh/lister/crates/__init__.py +++ b/swh/lister/crates/__init__.py @@ -1,142 +1,142 @@ # Copyright (C) 2022 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Crates lister ============= The Crates lister list origins from `Crates.io`_, the Rust community’s crate registry. Origins are `packages`_ for the `Rust language`_ ecosystem. Package follow a `layout specifications`_ to be usable with the `Cargo`_ package manager and have a `Cargo.toml`_ file manifest which consists in metadata to describe and build a specific package version. As of August 2022 `Crates.io`_ list 89013 packages name for a total of 588215 released versions. Origins retrieving strategy --------------------------- A json http api to list packages from crates.io but we choose a `different strategy`_ in order to reduce to its bare minimum the amount of http call and bandwidth. We clone a git repository which contains a tree of directories whose last child folder name corresponds to the package name and contains a Cargo.toml file with some json data to describe all existing versions of the package. It takes a few seconds to clone the repository and browse it to build a full index of existing package and related versions. The lister is incremental, so the first time it clones and browses the repository as previously described then stores the last seen commit id. Next time, it retrieves the list of new and changed files since last commit id and returns new or changed package with all of their related versions. Note that all Git related operations are done with `Dulwich`_, a Python implementation of the Git file formats and protocols. Page listing ------------ Each page is related to one package. Each line of a page corresponds to different versions of this package. The data schema for each line is: * **name**: Package name * **version**: Package version * **crate_file**: Package download url * **checksum**: Package download checksum * **yanked**: Whether the package is yanked or not * **last_update**: Iso8601 last update date computed upon git commit date of the related Cargo.toml file Origins from page ----------------- The lister yields one origin per page. The origin url corresponds to the http api url for a package, for example "https://crates.io/api/v1/crates/{package}". Additionally we add some data set to "extra_loader_arguments": * **artifacts**: Represent data about the Crates to download, following :ref:`original-artifacts-json specification ` * **crates_metadata**: To store all other interesting attributes that do not belongs to artifacts. For now it mainly indicate when a version is `yanked`_. Origin data example:: { "url": "https://crates.io/api/v1/crates/rand", "artifacts": [ { "checksums": { "sha256": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", # noqa: B950 }, "filename": "rand-0.1.1.crate", "url": "https://static.crates.io/crates/rand/rand-0.1.1.crate", "version": "0.1.1", }, { "checksums": { "sha256": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", # noqa: B950 }, "filename": "rand-0.1.2.crate", "url": "https://static.crates.io/crates/rand/rand-0.1.2.crate", "version": "0.1.2", }, ], "crates_metadata": [ { "version": "0.1.1", "yanked": False, }, { "version": "0.1.2", "yanked": False, }, ], } Running tests ------------- Activate the virtualenv and run from within swh-lister directory: pytest -s -vv --log-cli-level=DEBUG swh/lister/crates/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment: - docker-compose up -d + docker compose up -d -Then connect to the lister: +Then schedule a crates listing task:: - docker exec -it docker_swh-lister_1 bash + docker compose exec swh-scheduler swh scheduler task add -p oneshot list-crates -And run the lister (The output of this listing results in “oneshot” tasks in the scheduler): +You can follow lister execution by displaying logs of swh-lister service:: - swh lister run -l crates + docker compose logs -f swh-lister .. _Crates.io: https://crates.io .. _packages: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html .. _Rust language: https://www.rust-lang.org/ .. _layout specifications: https://doc.rust-lang.org/cargo/guide/project-layout.html .. _Cargo: https://doc.rust-lang.org/cargo/guide/why-cargo-exists.html#enter-cargo .. _Cargo.toml: https://doc.rust-lang.org/cargo/reference/manifest.html .. _different strategy: https://crates.io/data-access .. _Dulwich: https://www.dulwich.io/ .. _yanked: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-yank """ def register(): from .lister import CratesLister return { "lister": CratesLister, "task_modules": ["%s.tasks" % __name__], } diff --git a/swh/lister/pubdev/__init__.py b/swh/lister/pubdev/__init__.py index 63bde65..310595f 100644 --- a/swh/lister/pubdev/__init__.py +++ b/swh/lister/pubdev/__init__.py @@ -1,71 +1,71 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ Pub.dev lister ============== The Pubdev lister list origins from `pub.dev`_, the `Dart`_ and `Flutter`_ packages registry. The registry provide an `http api`_ from where the lister retrieve package names. As of August 2022 `pub.dev`_ list 33535 package names. Origins retrieving strategy --------------------------- To get a list of all package names we call `https://pub.dev/api/packages` endpoint. There is no other way for discovery (no archive index, no database dump, no dvcs repository). Page listing ------------ There is only one page that list all origins url based on `https://pub.dev/api/packages/{pkgname}`. The origin url corresponds to the http api endpoint that returns complete information about the package versions (name, version, author, description, release date). Origins from page ----------------- The lister yields all origins url from one page. Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/pubdev/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: docker-compose up -d -Then connect to the lister:: +Then schedule a pubdev listing task:: - docker exec -it docker_swh-lister_1 bash + docker compose exec swh-scheduler swh scheduler task add -p oneshot list-pubdev -And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: +You can follow lister execution by displaying logs of swh-lister service:: - swh lister run -l pubdev + docker compose logs -f swh-lister .. _pub.dev: https://pub.dev .. _Dart: https://dart.dev .. _Flutter: https://flutter.dev .. _http api: https://pub.dev/help/api """ def register(): from .lister import PubDevLister return { "lister": PubDevLister, "task_modules": ["%s.tasks" % __name__], }