diff --git a/swh/lister/aur/__init__.py b/swh/lister/aur/__init__.py index d6db8a2..833c72b 100644 --- a/swh/lister/aur/__init__.py +++ b/swh/lister/aur/__init__.py @@ -1,135 +1,135 @@ # Copyright (C) 2022 the Software Heritage developers # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information """ AUR (Arch User Repository) lister ================================= The AUR lister list origins from `aur.archlinux.org`_, the Arch User Repository. For each package, there is a git repository, we use the git url as origin and the snapshot url as the artifact for the loader to download. Each git repository consist of a directory (for which name corresponds to the package name), and at least two files, .SRCINFO and PKGBUILD which are recipes for building the package. Each package has a version, the latest one. There isn't any archives of previous versions, so the lister will always list one version per package. As of August 2022 `aur.archlinux.org`_ list 84438 packages. Please note that this amount is the total of `regular`_ and `split`_ packages. We will archive `regular` and `split` packages but only their `pkgbase` because that is the only one that actually has source code. The packages amount is 78554 after removing the split ones. Origins retrieving strategy --------------------------- An rpc api exists but it is recommended to save bandwidth so it's not used. See `New AUR Metadata Archives`_ for more on this topic. To get an index of all AUR existing packages we download a `packages-meta-v1.json.gz`_ which contains a json file listing all existing packages definitions. Each entry describes the latest released version of a package. The origin url for a package is built using `pkgbase` and corresponds to a git repository. Note that we list only standard package (when pkgbase equal pkgname), not the ones belonging to split packages. It takes only a couple of minutes to download the 7 MB index archive and parses its content. Page listing ------------ Each page is related to one package. As its not possible to get all previous versions, it will always returns one line. Each page corresponds to a package with a `version`, an `url` for a Git repository, a `project_url` which represents the upstream project url and a canonical `snapshot_url` from which a tar.gz archive of the package can be downloaded. The data schema for each line is: * **pkgname**: Package name * **version**: Package version * **url**: Git repository url for a package * **snapshot_url**: Package download url * **project_url**: Upstream project url if any * **last_modified**: Iso8601 last update date Origins from page ----------------- The lister yields one origin per page. The origin url corresponds to the git url of a package, for example ``https://aur.archlinux.org/{package}.git``. Additionally we add some data set to "extra_loader_arguments": * **artifacts**: Represent data about the Aur package snapshot to download, - following :ref:`original-artifacts-json specification ` + following :ref:`original-artifacts-json specification ` * **aur_metadata**: To store all other interesting attributes that do not belongs to artifacts. Origin data example:: { "visit_type": "aur", "url": "https://aur.archlinux.org/hg-evolve.git", "extra_loader_arguments": { "artifacts": [ { "filename": "hg-evolve.tar.gz", "url": "https://aur.archlinux.org/cgit/aur.git/snapshot/hg-evolve.tar.gz", # noqa: B950 "version": "10.5.1-1", } ], "aur_metadata": [ { "version": "10.5.1-1", "project_url": "https://www.mercurial-scm.org/doc/evolution/", "last_update": "2022-04-27T20:02:56+00:00", "pkgname": "hg-evolve", } ], }, Running tests ------------- Activate the virtualenv and run from within swh-lister directory:: pytest -s -vv --log-cli-level=DEBUG swh/lister/aur/tests Testing with Docker ------------------- Change directory to swh/docker then launch the docker environment:: docker-compose up -d Then connect to the lister:: docker exec -it docker_swh-lister_1 bash And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: swh lister run -l aur .. _aur.archlinux.org: https://aur.archlinux.org .. _New AUR Metadata Archives: https://lists.archlinux.org/pipermail/aur-general/2021-November/036659.html .. _packages-meta-v1.json.gz: https://aur.archlinux.org/packages-meta-v1.json.gz .. _regular: https://wiki.archlinux.org/title/PKGBUILD#Package_name .. _split: https://man.archlinux.org/man/PKGBUILD.5#PACKAGE_SPLITTING """ def register(): from .lister import AurLister return { "lister": AurLister, "task_modules": ["%s.tasks" % __name__], }