Changeset View
Changeset View
Standalone View
Standalone View
swh/lister/aur/__init__.py
- This file was added.
# Copyright (C) 2022 the Software Heritage developers | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
""" | |||||
AUR (Arch User Repository) lister | |||||
================================= | |||||
The AUR lister list origins from `aur.archlinux.org`_, the Arch User Repository. | |||||
For each package, there is a git repository, we use the git url as origin and the | |||||
snapshot url as the artifact for the loader to download. | |||||
Each git repository consist of a directory (for which name corresponds to the package name), | |||||
and at least two files, .SRCINFO and PKGBUILD which are recipes for building the package. | |||||
Each package has a version, the latest one. There isn't any archives of previous versions, | |||||
so the lister will always list one version per package. | |||||
As of August 2022 `aur.archlinux.org`_ list 84438 packages. Please note that this amount | |||||
is the total of `regular`_ and `split`_ packages. We take decision to not list packages | |||||
belonging to split packages as they do not have a valid snapshot url. | |||||
The packages amount is 78554 after removing the split ones. | |||||
vlorentz: I don't think that's accurate; we do archive split packages, but only their "pkgbase" because… | |||||
franckbretAuthorUnsubmitted Done Inline ActionsI rephrase it, does it looks ok for you? franckbret: I rephrase it, does it looks ok for you? | |||||
Origins retrieving strategy | |||||
--------------------------- | |||||
An rpc api exists but it is recommended to save bandwidth so it's not used. See | |||||
`New AUR Metadata Archives`_ for more on this topic. | |||||
To get an index of all AUR existing packages we download a `packages-meta-v1.json.gz`_ | |||||
which contains a json file listing all existing packages definitions. | |||||
Each entry describes the latest released version of a package. The origin url | |||||
for a package is built using `pkgbase` and corresponds to a git repository. | |||||
Note that we list only standard package (when pkgbase equal pkgname), not the ones | |||||
belonging to split packages. | |||||
It takes only a couple of minutes to download the 7 MB index archive and parses its | |||||
content. | |||||
Page listing | |||||
------------ | |||||
Each page is related to one package. As its not possible to get all previous | |||||
versions, it will always returns one line. | |||||
Each page corresponds to a package with a `version`, an `url` for a Git | |||||
repository, a `project_url` which represents the upstream project url and | |||||
a canonical `snapshot_url` from which a tar.gz archive of the package can | |||||
be downloaded. | |||||
The data schema for each line is: | |||||
* **pkgname**: Package name | |||||
* **version**: Package version | |||||
* **url**: Git repository url for a package | |||||
* **snapshot_url**: Package download url | |||||
* **project_url**: Upstream project url if any | |||||
* **last_modified**: Iso8601 last update date | |||||
Origins from page | |||||
----------------- | |||||
The lister yields one origin per page. | |||||
The origin url corresponds to the git url of a package, for example ``https://aur.archlinux.org/{package}.git``. | |||||
Additionally we add some data set to "extra_loader_arguments": | |||||
* **artifacts**: Represent data about the Aur package snapshot to download, | |||||
following :ref:`original-artifacts-json specification <original-artifacts-json>` | |||||
* **aur_metadata**: To store all other interesting attributes that do not belongs to artifacts. | |||||
Origin data example:: | |||||
{ | |||||
"visit_type": "aur", | |||||
"url": "https://aur.archlinux.org/hg-evolve.git", | |||||
"extra_loader_arguments": { | |||||
"artifacts": [ | |||||
{ | |||||
"filename": "hg-evolve.tar.gz", | |||||
"url": "https://aur.archlinux.org/cgit/aur.git/snapshot/hg-evolve.tar.gz", # noqa: B950 | |||||
"version": "10.5.1-1", | |||||
} | |||||
], | |||||
"aur_metadata": [ | |||||
{ | |||||
"version": "10.5.1-1", | |||||
"project_url": "https://www.mercurial-scm.org/doc/evolution/", | |||||
"last_update": "2022-04-27T20:02:56+00:00", | |||||
"pkgname": "hg-evolve", | |||||
} | |||||
], | |||||
}, | |||||
vlorentzUnsubmitted Not Done Inline Actionswhy is "version" in both? vlorentz: why is `"version"` in both? | |||||
franckbretAuthorUnsubmitted Done Inline ActionsBecause its the associative key for lines for both dict which is needed for the loader the get the corresponding informations. franckbret: Because its the associative key for lines for both dict which is needed for the loader the get… | |||||
vlorentzUnsubmitted Not Done Inline Actionsnevermind, it's fine vlorentz: nevermind, it's fine | |||||
Running tests | |||||
------------- | |||||
Activate the virtualenv and run from within swh-lister directory:: | |||||
pytest -s -vv --log-cli-level=DEBUG swh/lister/aur/tests | |||||
Testing with Docker | |||||
------------------- | |||||
Change directory to swh/docker then launch the docker environment:: | |||||
docker-compose up -d | |||||
Then connect to the lister:: | |||||
docker exec -it docker_swh-lister_1 bash | |||||
And run the lister (The output of this listing results in “oneshot” tasks in the scheduler):: | |||||
swh lister run -l aur | |||||
.. _aur.archlinux.org: https://aur.archlinux.org | |||||
.. _New AUR Metadata Archives: https://lists.archlinux.org/pipermail/aur-general/2021-November/036659.html | |||||
.. _packages-meta-v1.json.gz: https://aur.archlinux.org/packages-meta-v1.json.gz | |||||
.. _regular: https://wiki.archlinux.org/title/PKGBUILD#Package_name | |||||
.. _split: https://man.archlinux.org/man/PKGBUILD.5#PACKAGE_SPLITTING | |||||
""" | |||||
def register(): | |||||
from .lister import AurLister | |||||
return { | |||||
"lister": AurLister, | |||||
"task_modules": ["%s.tasks" % __name__], | |||||
} |
I don't think that's accurate; we do archive split packages, but only their "pkgbase" because that is the only one that actually has source code.