Page MenuHomeSoftware Heritage

[WIP] Add arch lister module.
Needs ReviewPublic

Authored by franckbret on Wed, May 11, 2:37 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T4233: Ingest Arch Linux
Summary

First stab at a Arch Linux lister.

Arch linux provides several way to discover packages but no easy way to get history of previous released version of a package.
After some discussion on Archlinux forum, https://bbs.archlinux.org/viewtopic.php?id=275574 I've gone the git repository way.

This lister fetch a git repository to list origins, parsing PKGBUILD files.

Arch Linux distribution is made of 'core', 'extra' and 'community' repository.
Core and extra packages listed in https://github.com/archlinux/svntogit-packages, and 'community' in https://github.com/archlinux/svntogit-community

For now it fetches only 'core' and 'extra' packages from the first repository (421.44 MiB at this time). I'll add the second one if we are ok with first implementation (1.58 GiB). Both of git repository have several commit a day.

PKGBUILD file are bash executable file. The common way for building a package is to use makepkg which has a internal PKGBUILD parser, https://gitlab.archlinux.org/pacman/pacman/blob/master/scripts/makepkg.sh.in
I did not found a PKGBUILD file parser in python in Pypi. There is one python module on github named 'parched' https://github.com/sebnow/parched
I written a naïve parser, but it's not solid yet to manage all special cases.

Example of some PKGBUILD i've found that can be really hard to parse:

Related to T4233

Test Plan

Will run this one on docker to see how much time it takes on first run and evaluate parsing result accuracy.

Diff Detail

Repository
rDLS Listers
Branch
archlinux
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29280
Build 45777: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 45776: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7812 (id=28215)

Rebasing onto aa8c8cb3bc...

Current branch diff-target is up to date.
Changes applied before test
commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/console

Harbormaster returned this revision to the author for changes because remote builds failed.Wed, May 11, 2:38 PM
Harbormaster failed remote builds in B29277: Diff 28215!

Build is green

Patch application report for D7812 (id=28217)

Could not rebase; Attempt merge onto aa8c8cb3bc...

Updating aa8c8cb..5a1dd24
Fast-forward
 CONTRIBUTORS                                       |   1 +
 setup.py                                           |   1 +
 swh/lister/arch/__init__.py                        |  12 +
 swh/lister/arch/lister.py                          | 303 +++++++++++++++++++++
 swh/lister/arch/tasks.py                           |  19 ++
 swh/lister/arch/tests/__init__.py                  |  31 +++
 .../fake-archlinux-svntogit-packages-index.tar.gz  | Bin 0 -> 12173 bytes
 .../tests/data/fake_archlinux_repository_init.sh   | 129 +++++++++
 swh/lister/arch/tests/test_lister.py               | 131 +++++++++
 swh/lister/arch/tests/test_tasks.py                |  19 ++
 10 files changed, 646 insertions(+)
 create mode 100644 swh/lister/arch/__init__.py
 create mode 100644 swh/lister/arch/lister.py
 create mode 100644 swh/lister/arch/tasks.py
 create mode 100644 swh/lister/arch/tests/__init__.py
 create mode 100644 swh/lister/arch/tests/data/fake-archlinux-svntogit-packages-index.tar.gz
 create mode 100755 swh/lister/arch/tests/data/fake_archlinux_repository_init.sh
 create mode 100644 swh/lister/arch/tests/test_lister.py
 create mode 100644 swh/lister/arch/tests/test_tasks.py
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 14:59:24 2022 +0200

    Mypy fix, Use Typing.List instead

commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/521/ for more details.

Updating D7812: [WIP] Add arch lister module.

Build is green

Patch application report for D7812 (id=28218)

Rebasing onto aa8c8cb3bc...

Current branch diff-target is up to date.
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 14:59:24 2022 +0200

    Mypy fix, Use Typing.List instead

commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/522/ for more details.

I've made several experiments in order to find a better way to list arch linux package.

The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.

Here are some time execution metrics for downloading archive and parse desc files.

Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds

Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds

Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds

Example of retrieved package data after parsing:

{'arch': 'x86_64',
 'repo': 'core',
 'base': 'acl',
 'builddate': '1643730617',
 'conflicts': 'xfsacl',
 'csize': '138970',
 'desc': 'Access control list utilities, libraries and headers',
 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
 'isize': '325349',
 'license': 'LGPL',
 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
 'name': 'acl',
 'packager': 'Christian Hesse <eworm@archlinux.org>',
 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
 'project_url': 'https://savannah.nongnu.org/projects/acl',
 'replaces': 'xfsacl',
 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
 'version': '2.3.1-2'}

If we are ok to get only latest version, we can go this way.

Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:

  1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64
mercurial-4.9-1-x86_64
mercurial-4.9.1-1-x86_64
mercurial-5.0-1-x86_64
mercurial-5.0.1-1-x86_64
mercurial-5.0.2-1-x86_64
mercurial-5.1-1-x86_64
mercurial-5.1.2-1-x86_64
mercurial-5.2-1-x86_64
mercurial-5.2.1-1-x86_64
mercurial-5.2.2-1-x86_64
mercurial-5.2.2-2-x86_64
mercurial-5.3-1-x86_64
mercurial-5.3.1-1-x86_64
mercurial-5.3.2-1-x86_64
mercurial-5.4-1-x86_64
mercurial-5.4.1-1-x86_64
mercurial-5.4-2-x86_64
mercurial-5.4.2-1-x86_64
mercurial-5.5-1-x86_64
mercurial-5.5.1-1-x86_64
mercurial-5.5.2-1-x86_64
mercurial-5.6-1-x86_64
mercurial-5.6.1-1-x86_64
mercurial-5.6-2-x86_64
mercurial-5.6-3-x86_64
mercurial-5.7-1-x86_64
mercurial-5.7.1-1-x86_64
mercurial-5.8-1-x86_64
mercurial-5.8.1-1-x86_64
mercurial-5.8-2-x86_64
mercurial-5.9.1-1-x86_64
mercurial-5.9.1-2-x86_64
mercurial-5.9.2-1-x86_64
mercurial-5.9.3-1-x86_64
mercurial-6.0-1-x86_64
mercurial-6.0.1-1-x86_64
mercurial-6.0-2-x86_64
mercurial-6.0.2-1-x86_64
mercurial-6.0-3-x86_64
mercurial-6.0.3-1-x86_64
mercurial-6.1-1-x86_64
mercurial-6.1.1-1-x86_64
mercurial-6.1-2-x86_64
mercurial-6.1.2-1-x86_64

Pro : One 500 ko file to download, one dynamic regex to find matches
Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...

  1. Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/

Pro: Easy to scrap + a release date is associated to a version
Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

@vlorentz @ardumont @bchauvet what do you think, what do you prefer?

Also do you I cancel that issue and create a new one to go on?