Changeset View
Changeset View
Standalone View
Standalone View
swh/lister/aur/__init__.py
- This file was added.
| # Copyright (C) 2022 the Software Heritage developers | |||||
| # License: GNU General Public License version 3, or any later version | |||||
| # See top-level LICENSE file for more information | |||||
| """ | |||||
| AUR (Arch User Repository) lister | |||||
| ================================= | |||||
| The AUR lister list origins from `aur.archlinux.org`_, the Arch User Repository. | |||||
| For each package, there is a git repository, we use the git url as origin and the | |||||
| snapshot url as the artifact for the loader to download. | |||||
| Each git repository consist of a directory (for which name corresponds to the package name), | |||||
| and at least two files, .SRCINFO and PKGBUILD which are recipes for building the package. | |||||
| Each package has a version, the latest one. There isn't any archives of previous versions, | |||||
| so the lister will always list one version per package. | |||||
| As of August 2022 `aur.archlinux.org`_ list 84438 packages. Please note that this amount | |||||
| is the total of `regular`_ and `split`_ packages. We take decision to not list packages | |||||
| belonging to split packages as they do not have a valid snapshot url. | |||||
| The packages amount is 78554 after removing the split ones. | |||||
vlorentz: I don't think that's accurate; we do archive split packages, but only their "pkgbase" because… | |||||
franckbretAuthorUnsubmitted Done Inline ActionsI rephrase it, does it looks ok for you? franckbret: I rephrase it, does it looks ok for you? | |||||
| Origins retrieving strategy | |||||
| --------------------------- | |||||
| An rpc api exists but it is recommended to save bandwidth so it's not used. See | |||||
| `New AUR Metadata Archives`_ for more on this topic. | |||||
| To get an index of all AUR existing packages we download a `packages-meta-v1.json.gz`_ | |||||
| which contains a json file listing all existing packages definitions. | |||||
| Each entry describes the latest released version of a package. The origin url | |||||
| for a package is built using `pkgbase` and corresponds to a git repository. | |||||
| Note that we list only standard package (when pkgbase equal pkgname), not the ones | |||||
| belonging to split packages. | |||||
| It takes only a couple of minutes to download the 7 MB index archive and parses its | |||||
| content. | |||||
| Page listing | |||||
| ------------ | |||||
| Each page is related to one package. As its not possible to get all previous | |||||
| versions, it will always returns one line. | |||||
| Each page corresponds to a package with a `version`, an `url` for a Git | |||||
| repository, a `project_url` which represents the upstream project url and | |||||
| a canonical `snapshot_url` from which a tar.gz archive of the package can | |||||
| be downloaded. | |||||
| The data schema for each line is: | |||||
| * **pkgname**: Package name | |||||
| * **version**: Package version | |||||
| * **url**: Git repository url for a package | |||||
| * **snapshot_url**: Package download url | |||||
| * **project_url**: Upstream project url if any | |||||
| * **last_modified**: Iso8601 last update date | |||||
| Origins from page | |||||
| ----------------- | |||||
| The lister yields one origin per page. | |||||
| The origin url corresponds to the git url of a package, for example ``https://aur.archlinux.org/{package}.git``. | |||||
| Additionally we add some data set to "extra_loader_arguments": | |||||
| * **artifacts**: Represent data about the Aur package snapshot to download, | |||||
| following :ref:`original-artifacts-json specification <original-artifacts-json>` | |||||
| * **aur_metadata**: To store all other interesting attributes that do not belongs to artifacts. | |||||
| Origin data example:: | |||||
| { | |||||
| "visit_type": "aur", | |||||
| "url": "https://aur.archlinux.org/hg-evolve.git", | |||||
| "extra_loader_arguments": { | |||||
| "artifacts": [ | |||||
| { | |||||
| "filename": "hg-evolve.tar.gz", | |||||
| "url": "https://aur.archlinux.org/cgit/aur.git/snapshot/hg-evolve.tar.gz", # noqa: B950 | |||||
| "version": "10.5.1-1", | |||||
| } | |||||
| ], | |||||
| "aur_metadata": [ | |||||
| { | |||||
| "version": "10.5.1-1", | |||||
| "project_url": "https://www.mercurial-scm.org/doc/evolution/", | |||||
| "last_update": "2022-04-27T20:02:56+00:00", | |||||
| "pkgname": "hg-evolve", | |||||
| } | |||||
| ], | |||||
| }, | |||||
vlorentzUnsubmitted Not Done Inline Actionswhy is "version" in both? vlorentz: why is `"version"` in both? | |||||
franckbretAuthorUnsubmitted Done Inline ActionsBecause its the associative key for lines for both dict which is needed for the loader the get the corresponding informations. franckbret: Because its the associative key for lines for both dict which is needed for the loader the get… | |||||
vlorentzUnsubmitted Not Done Inline Actionsnevermind, it's fine vlorentz: nevermind, it's fine | |||||
| Running tests | |||||
| ------------- | |||||
| Activate the virtualenv and run from within swh-lister directory:: | |||||
| pytest -s -vv --log-cli-level=DEBUG swh/lister/aur/tests | |||||
| Testing with Docker | |||||
| ------------------- | |||||
| Change directory to swh/docker then launch the docker environment:: | |||||
| docker-compose up -d | |||||
| Then connect to the lister:: | |||||
| docker exec -it docker_swh-lister_1 bash | |||||
| And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: | |||||
| swh lister run -l aur | |||||
| .. _aur.archlinux.org: https://aur.archlinux.org | |||||
| .. _New AUR Metadata Archives: https://lists.archlinux.org/pipermail/aur-general/2021-November/036659.html | |||||
| .. _packages-meta-v1.json.gz: https://aur.archlinux.org/packages-meta-v1.json.gz | |||||
| .. _regular: https://wiki.archlinux.org/title/PKGBUILD#Package_name | |||||
| .. _split: https://man.archlinux.org/man/PKGBUILD.5#PACKAGE_SPLITTING | |||||
| """ | |||||
| def register(): | |||||
| from .lister import AurLister | |||||
| return { | |||||
| "lister": AurLister, | |||||
| "task_modules": ["%s.tasks" % __name__], | |||||
| } | |||||
I don't think that's accurate; we do archive split packages, but only their "pkgbase" because that is the only one that actually has source code.