Page MenuHomeSoftware Heritage

[WIP] Add arch lister module.
AbandonedPublic

Authored by franckbret on May 11 2022, 2:37 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T4233: Ingest Arch Linux
Summary

First stab at a Arch Linux lister.

Arch linux provides several way to discover packages but no easy way to get history of previous released version of a package.
After some discussion on Archlinux forum, https://bbs.archlinux.org/viewtopic.php?id=275574 I've gone the git repository way.

This lister fetch a git repository to list origins, parsing PKGBUILD files.

Arch Linux distribution is made of 'core', 'extra' and 'community' repository.
Core and extra packages listed in https://github.com/archlinux/svntogit-packages, and 'community' in https://github.com/archlinux/svntogit-community

For now it fetches only 'core' and 'extra' packages from the first repository (421.44 MiB at this time). I'll add the second one if we are ok with first implementation (1.58 GiB). Both of git repository have several commit a day.

PKGBUILD file are bash executable file. The common way for building a package is to use makepkg which has a internal PKGBUILD parser, https://gitlab.archlinux.org/pacman/pacman/blob/master/scripts/makepkg.sh.in
I did not found a PKGBUILD file parser in python in Pypi. There is one python module on github named 'parched' https://github.com/sebnow/parched
I written a naïve parser, but it's not solid yet to manage all special cases.

Example of some PKGBUILD i've found that can be really hard to parse:

Related to T4233

Test Plan

Will run this one on docker to see how much time it takes on first run and evaluate parsing result accuracy.

Diff Detail

Repository
rDLS Listers
Branch
archlinux
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29279
Build 45775: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 45774: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7812 (id=28215)

Rebasing onto aa8c8cb3bc...

Current branch diff-target is up to date.
Changes applied before test
commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 11 2022, 2:38 PM
Harbormaster failed remote builds in B29277: Diff 28215!

Build is green

Patch application report for D7812 (id=28217)

Could not rebase; Attempt merge onto aa8c8cb3bc...

Updating aa8c8cb..5a1dd24
Fast-forward
 CONTRIBUTORS                                       |   1 +
 setup.py                                           |   1 +
 swh/lister/arch/__init__.py                        |  12 +
 swh/lister/arch/lister.py                          | 303 +++++++++++++++++++++
 swh/lister/arch/tasks.py                           |  19 ++
 swh/lister/arch/tests/__init__.py                  |  31 +++
 .../fake-archlinux-svntogit-packages-index.tar.gz  | Bin 0 -> 12173 bytes
 .../tests/data/fake_archlinux_repository_init.sh   | 129 +++++++++
 swh/lister/arch/tests/test_lister.py               | 131 +++++++++
 swh/lister/arch/tests/test_tasks.py                |  19 ++
 10 files changed, 646 insertions(+)
 create mode 100644 swh/lister/arch/__init__.py
 create mode 100644 swh/lister/arch/lister.py
 create mode 100644 swh/lister/arch/tasks.py
 create mode 100644 swh/lister/arch/tests/__init__.py
 create mode 100644 swh/lister/arch/tests/data/fake-archlinux-svntogit-packages-index.tar.gz
 create mode 100755 swh/lister/arch/tests/data/fake_archlinux_repository_init.sh
 create mode 100644 swh/lister/arch/tests/test_lister.py
 create mode 100644 swh/lister/arch/tests/test_tasks.py
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 14:59:24 2022 +0200

    Mypy fix, Use Typing.List instead

commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/521/ for more details.

Updating D7812: [WIP] Add arch lister module.

Build is green

Patch application report for D7812 (id=28218)

Rebasing onto aa8c8cb3bc...

Current branch diff-target is up to date.
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 14:59:24 2022 +0200

    Mypy fix, Use Typing.List instead

commit b016519fc6cf9810be42364f140d294c96c9c7c2
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 11 13:34:32 2022 +0200

    [WIP] Add arch lister module.
    
    For now it fetch a git repository to list origin parsing PKGBUILD files.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/522/ for more details.

I've made several experiments in order to find a better way to list arch linux package.

The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.

Here are some time execution metrics for downloading archive and parse desc files.

Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds

Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds

Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds

Example of retrieved package data after parsing:

{'arch': 'x86_64',
 'repo': 'core',
 'base': 'acl',
 'builddate': '1643730617',
 'conflicts': 'xfsacl',
 'csize': '138970',
 'desc': 'Access control list utilities, libraries and headers',
 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
 'isize': '325349',
 'license': 'LGPL',
 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
 'name': 'acl',
 'packager': 'Christian Hesse <eworm@archlinux.org>',
 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
 'project_url': 'https://savannah.nongnu.org/projects/acl',
 'replaces': 'xfsacl',
 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
 'version': '2.3.1-2'}

If we are ok to get only latest version, we can go this way.

Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:

  1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64
mercurial-4.9-1-x86_64
mercurial-4.9.1-1-x86_64
mercurial-5.0-1-x86_64
mercurial-5.0.1-1-x86_64
mercurial-5.0.2-1-x86_64
mercurial-5.1-1-x86_64
mercurial-5.1.2-1-x86_64
mercurial-5.2-1-x86_64
mercurial-5.2.1-1-x86_64
mercurial-5.2.2-1-x86_64
mercurial-5.2.2-2-x86_64
mercurial-5.3-1-x86_64
mercurial-5.3.1-1-x86_64
mercurial-5.3.2-1-x86_64
mercurial-5.4-1-x86_64
mercurial-5.4.1-1-x86_64
mercurial-5.4-2-x86_64
mercurial-5.4.2-1-x86_64
mercurial-5.5-1-x86_64
mercurial-5.5.1-1-x86_64
mercurial-5.5.2-1-x86_64
mercurial-5.6-1-x86_64
mercurial-5.6.1-1-x86_64
mercurial-5.6-2-x86_64
mercurial-5.6-3-x86_64
mercurial-5.7-1-x86_64
mercurial-5.7.1-1-x86_64
mercurial-5.8-1-x86_64
mercurial-5.8.1-1-x86_64
mercurial-5.8-2-x86_64
mercurial-5.9.1-1-x86_64
mercurial-5.9.1-2-x86_64
mercurial-5.9.2-1-x86_64
mercurial-5.9.3-1-x86_64
mercurial-6.0-1-x86_64
mercurial-6.0.1-1-x86_64
mercurial-6.0-2-x86_64
mercurial-6.0.2-1-x86_64
mercurial-6.0-3-x86_64
mercurial-6.0.3-1-x86_64
mercurial-6.1-1-x86_64
mercurial-6.1.1-1-x86_64
mercurial-6.1-2-x86_64
mercurial-6.1.2-1-x86_64

Pro : One 500 ko file to download, one dynamic regex to find matches
Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...

  1. Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/

Pro: Easy to scrap + a release date is associated to a version
Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

@vlorentz @ardumont @bchauvet what do you think, what do you prefer?

Also do you I cancel that issue and create a new one to go on?

Note:
List the contents of pacman databases as JSON for web applications snippet [1]
[1] https://bbs.archlinux.org/viewtopic.php?pid=1969414#p1969414

I've made several experiments in order to find a better way to list arch linux
package.

The most efficient way I've found is to download tar.gz files which contains package
name as directory and a "desc" file with easy to parse metadata. It works fine but
retrieve only the latest version of a package.

Here are some time execution metrics for downloading archive and parse desc files.

Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds

Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds

Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds

Example of retrieved package data after parsing:

{'arch': 'x86_64',
 'repo': 'core',
 'base': 'acl',
 'builddate': '1643730617',
 'conflicts': 'xfsacl',
 'csize': '138970',
 'desc': 'Access control list utilities, libraries and headers',
 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
 'isize': '325349',
 'license': 'LGPL',
 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
 'name': 'acl',
 'packager': 'Christian Hesse <eworm@archlinux.org>',
 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
 'project_url': 'https://savannah.nongnu.org/projects/acl',
 'replaces': 'xfsacl',
 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
 'version': '2.3.1-2'}

If we are ok to get only latest version, we can go this way.

(as a data point) That's currently the way we are retrieving information for CRAN
packages. CRAN (infra) only exposes the latest version of a package (it exposes archived
versions with a dedicated instance we are not currently listing).

But our lister is listing them everyday so from the moment we started ingested them, we
should have some versions for one package already. At some point, we'll have to attend
to the archived ones as well.

So I guess, given your current experiments reported here (through the description and
this very comment), it'd be ok to do the same than CRAN here.

Nonetheless, it's possible to get other versions of a package through two different
strategies, each with some pros and cons:

  1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64
mercurial-4.9-1-x86_64
mercurial-4.9.1-1-x86_64
mercurial-5.0-1-x86_64
...

Pro : One 500 ko file to download, one dynamic regex to find matches
Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries.
It tooks 16 min for the regex to find match for +/- 15000 packages...

  1. Scrap server directory listing to get previous versions of a package with its

release date, for example https://archive.archlinux.org/packages/m/mercurial/
Pro: Easy to scrap + a release date is associated to a version
Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

@vlorentz @ardumont @bchauvet what do you think, what do you prefer?

As mentioned, I'd go for the simplest solution (first one which allows more simple
metadata retrieval for the latest version only).

@vlorentz @bchauvet thoughts?

Also do you I cancel that issue and create a new one to go on?

You can go either way. If you keep that one, it'd be easier to compare with your future
version (and the future review will be simpler, no noisy old comments). If you keep it,
we can still find its initial version through the history tab (within the web ui).

Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on
friday ¯\_(ツ)_/¯ ;)

Cheers,

Some early comments before I forget, in case they are useful. feel free to ignore them if you are going to change these parts of the code anyway

swh/lister/arch/lister.py
85

it would also remove both if a string is meant to contain quotes. use re.sub("""["'](.*)["']""", ...)

145–148

shell injection: pkgbuild_path is not trusted and neither escaped nor validated.

try Dulwich, it's more reliable than parsing Git's output anyway.

223–225

why in?

260–264

pass the list of archs and source packages; the loader can deal with them by creating them as different releases in the same origin (kind of like this: https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/pypi/loader.py$121-126 )

I've made several experiments in order to find a better way to list arch linux
package.

The most efficient way I've found is to download tar.gz files which contains package
name as directory and a "desc" file with easy to parse metadata. It works fine but
retrieve only the latest version of a package.

Here are some time execution metrics for downloading archive and parse desc files.

Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds

Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds

Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds

Example of retrieved package data after parsing:

{'arch': 'x86_64',
 'repo': 'core',
 'base': 'acl',
 'builddate': '1643730617',
 'conflicts': 'xfsacl',
 'csize': '138970',
 'desc': 'Access control list utilities, libraries and headers',
 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
 'isize': '325349',
 'license': 'LGPL',
 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
 'name': 'acl',
 'packager': 'Christian Hesse <eworm@archlinux.org>',
 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
 'project_url': 'https://savannah.nongnu.org/projects/acl',
 'replaces': 'xfsacl',
 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
 'version': '2.3.1-2'}

If we are ok to get only latest version, we can go this way.

(as a data point) That's currently the way we are retrieving information for CRAN
packages. CRAN (infra) only exposes the latest version of a package (it exposes archived
versions with a dedicated instance we are not currently listing).

But our lister is listing them everyday so from the moment we started ingested them, we
should have some versions for one package already. At some point, we'll have to attend
to the archived ones as well.

So I guess, given your current experiments reported here (through the description and
this very comment), it'd be ok to do the same than CRAN here.

Nonetheless, it's possible to get other versions of a package through two different
strategies, each with some pros and cons:

  1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64
mercurial-4.9-1-x86_64
mercurial-4.9.1-1-x86_64
mercurial-5.0-1-x86_64
...

Pro : One 500 ko file to download, one dynamic regex to find matches
Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries.
It tooks 16 min for the regex to find match for +/- 15000 packages...

  1. Scrap server directory listing to get previous versions of a package with its

release date, for example https://archive.archlinux.org/packages/m/mercurial/
Pro: Easy to scrap + a release date is associated to a version
Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

@vlorentz @ardumont @bchauvet what do you think, what do you prefer?

As mentioned, I'd go for the simplest solution (first one which allows more simple
metadata retrieval for the latest version only).

@vlorentz @bchauvet thoughts?

Also do you I cancel that issue and create a new one to go on?

You can go either way. If you keep that one, it'd be easier to compare with your future
version (and the future review will be simpler, no noisy old comments). If you keep it,
we can still find its initial version through the history tab (within the web ui).

Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on
friday ¯\_(ツ)_/¯ ;)

Cheers,

New patch that fetch archives instead of git repository is D7894

New patch that fetch archives instead of git repository is D7894

Awesome, thanks.

You can close this one then (it won't disappear).