Page MenuHomeSoftware Heritage

Add arch lister module (origins from archives).
ClosedPublic

Authored by franckbret on May 25 2022, 3:08 PM.

Details

Summary

After a first attempt with D7812 this one use a different strategy to
retrieve origins.

Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.

Related to T4233

Diff Detail

Repository
rDLS Listers
Branch
archlinuxfromarchives
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29584
Build 46234: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 46233: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build has FAILED

Patch application report for D7894 (id=28477)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit c6c40f0d61f42a4c4f528a12c038b0f9cbfbb8cf
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    Related: T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/533/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/533/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 25 2022, 3:25 PM
Harbormaster failed remote builds in B29567: Diff 28477!

Build is green

Patch application report for D7894 (id=28481)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit baa9dc5c10f8b2ca52922ee8842b7fa68becd44c
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    Related: T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/535/ for more details.

@ardumont @vlorentz

This one is ready for review.

Not sure if getting size of packages versions this way is relevant(size_to_bytes method), because it is not precise, so I doubt we can use that value to check the size of a downloaded versions as it will always be false. I can remove that part.

Updating D7894: [WIP] Add arch lister module (origins from archives).

Add 'name' and 'version' to artifacts dict.
This will be useful for the loader to list versions instead of guessing again through filename parsing.

Build is green

Patch application report for D7894 (id=28494)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit a16da2b6818e6127bf8aec92362706831f768116
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    Related T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/536/ for more details.

LGTM overall, but please do not use f-strings for logging statements, especially debug ones. https://docs.python.org/3/howto/logging.html#logging-variable-data

A bunch of nitpicks below:

swh/lister/arch/lister.py
57
133

Could you assert the href, to make sure we don't accidentally drop an unrelated link if they change the order?

148
148–153
150

Why break instead of continue? and shouldn't the log be an error?

161–166

ditto on all three accounts.

207

why is x86_64 hardcoded here?

208

filename = "foo.tar.gz" should be good enough and will avoid issues with "fun" file names

256–262

values computed here do not seem to be tested

258

I wonder if it would make more sense to use a relative URL here. (relative to the package's root)

280–283

ditto

swh/lister/arch/tests/test_lister.py
983–988

this will make failures more readable, as pytest will display a nice diff between the two lists

swh/lister/arch/tests/test_tasks.py
11–19

wrong file

franckbret marked 11 inline comments as done.

Updating D7894: [WIP] Add arch lister module (origins from archives).
Various code changes after @vlorentz review

franckbret added inline comments.
swh/lister/arch/lister.py
207

Because its the only available path, check the following main links.

https://archive.archlinux.org/repos/last/core/os/
https://archive.archlinux.org/repos/last/extra/os/
https://archive.archlinux.org/repos/last/community/os/

By the way there is only 'any' or 'x86_64' arch available and we can find 'any' arch package in x86_64 path, example :

https://archive.archlinux.org/repos/last/core/os/x86_64/amd-ucode-20220509.b19cbdc-1-any.pkg.tar.zst

I did not found a package that has two different arch too.

258

Not sure too, what does it change to do so?

swh/lister/arch/tests/test_lister.py
983–988

Ha nice! I've passed a hard time building fixtures because of this

swh/lister/arch/tests/test_tasks.py
11–19

uh, thx

Build is green

Patch application report for D7894 (id=28562)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit bdf53bec45751355467da8bbc3def22849952ac3
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    Related T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/537/ for more details.

I've skimmed through a bit and this does lgtm from afar so far.

swh/lister/arch/lister.py
207

Arch used to have x86, and might add a new architecture in the future, right? I'd rather not silently ignore many packages in case we miss the news.

And is https://uk.mirror.archlinuxarm.org/ out of scope for this lister? It looks similar, but subtly different...

swh/lister/arch/lister.py
258

Concretely, nothing. Absolute URLs just feels redundant

franckbret added inline comments.
swh/lister/arch/lister.py
207

I've looked at it and its the quite the same. We can get list of packages by downloading an index archive that contains all packages name and meta. The main difference is that we can't get a list of previous version per package, no need to scrap webpages in this case.

Interesting exchange here pasted here because i think it's relevant to it somehow:

20:27 <+vlorentz> franckbret: did you check this out? https://wiki.archlinux.org/title/Aurweb_RPC_interface
07:30 <franckbret> vlorentz: yes, but did not play with it yet because its dedicated to aur packages only. For now i'm ending arm repo integration to arch lister

(I suppose ending is actually adding ^)

Updating D7894: [WIP] Add arch lister module (origins from archives).

Add the ability to list also arm packages from unofficial archlinuxarm.org

Build has FAILED

Patch application report for D7894 (id=28744)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 43c4dde679634fcc04b881bcecedc778dc909583
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/538/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/538/console

Regenerate data fixtures (jenkins failed on previous commit but tests pass on my machine)

Build has FAILED

Patch application report for D7894 (id=28746)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 9fb7f57e4eee8ab727162df9d8787b135f4682d0
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/539/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/539/console

Updating D7894: [WIP] Add arch lister module (origins from archives).

File size must be int not string. There was an inconsistency in arm case.

Build has FAILED

Patch application report for D7894 (id=28747)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 76edb2e7dfe37f64396d7e12f83645fea53b6de4
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/540/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/540/console

Build has FAILED

Patch application report for D7894 (id=28747)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 76edb2e7dfe37f64396d7e12f83645fea53b6de4
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/541/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/541/console

You can temporarily replace the pytest command with pytest -vv in tox.ini to get the full diff between the two results on Jenkins

You can temporarily replace the pytest command with pytest -vv in tox.ini to get the full diff between the two results on Jenkins

Ok thanks! I suspect its related to python version. I'm actually building a new venv with python 3.7.3 to see if I can fail tests locally. Will try your hack if its not the case.

You can temporarily replace the pytest command with pytest -vv in tox.ini to get the full diff between the two results on Jenkins

Ok thanks! I suspect its related to python version. I'm actually building a new venv with python 3.7.3 to see if I can fail tests locally. Will try your hack if its not the case.

Well, works fine too with python3.7..

platform linux -- Python 3.7.3, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/franck/workspace/swh-environment/swh-lister, configfile: pytest.ini
plugins: mock-3.7.0, requests-mock-1.9.3, postgresql-3.1.3, hypothesis-6.47.2, redis-2.4.0, swh.core-2.11, forked-1.4.0, xdist-2.5.0, django-4.5.2, django-test-migrations-1.2.0, dash-2.5.0, flask-1.2.0, asyncio-0.18.3, anyio-3.6.1, swh.journal-1.0.1.dev14+g6b05a6c
asyncio: mode=strict
collected 3 items

swh/lister/arch/tests/test_lister.py .
swh/lister/arch/tests/test_tasks.py ..

3 passed, 4 warnings in 9.65s

Let's try with -vv in tox.ini

Updating D7894: [WIP] Add arch lister module (origins from archives).

Add temporarly -vv option to pytest in tox.ini in order to see a complete log of errors as I can not reproduce in my development environment (tests pass locally with python3.7.3, python3.9.2).

Build has FAILED

Patch application report for D7894 (id=28749)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 89d3f98cf32dbe6d015819cb0c6401a848616276
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/542/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/542/console

Updating D7894: [WIP] Add arch lister module (origins from archives).

One dash not two for pytest verbose option

Build has FAILED

Patch application report for D7894 (id=28752)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 2ba4b5dcd8151d6b2201982f94ef5d552bb3d715
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/543/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/543/console

Ah, it's because your tests expect the timezone to be UTC+2 to pass on the aarch64 repo!

Updating D7894: [WIP] Add arch lister module (origins from archives).

Ensure last_modified datetime entries are utc aware (that was the reason the CI failed previously).

Build is green

Patch application report for D7894 (id=28763)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit ba6a39d53b3f56a27ad9c84261f272b13136b202
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    [WIP] Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/544/ for more details.

Updating D7894: Add arch lister module (origins from archives).

Remove verbose options from tox.ini (previously temporarly used to spot CI failed test issue)

Build is green

Patch application report for D7894 (id=28764)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 0f95a39b9884c67f06a5ce024865956752f5e2d7
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/545/ for more details.

franckbret retitled this revision from [WIP] Add arch lister module (origins from archives). to Add arch lister module (origins from archives)..Jun 14 2022, 9:38 AM

Updating D7894: Add arch lister module (origins from archives).

Fix an issue with the regex that parse filename (needs escaping for filename like 'dvd+rw-tools')
Add missing variable to some loggers

Updating D7894: Add arch lister module (origins from archives).

Replace archlinux arm mirror entry value 'mirror.archlinuxarm.org' with 'uk.mirror.archlinux.org'.
Looks like they recently changed redirection of main mirror domain as it was correcty redirecting two days ago.

Build has FAILED

Patch application report for D7894 (id=28786)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 8dab0986fe58ac9cb725a2ee2ac597a4a187c130
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/547/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/547/console

Updating D7894: Add arch lister module (origins from archives).

Fix tests data fixtures according to previous commit (switch archlinux arm mirror from mirror.archlinuxarm.org to uk.mirror.archlinuxarm.org)

Build is green

Patch application report for D7894 (id=28787)

Rebasing onto 263db667d0...

Current branch diff-target is up to date.
Changes applied before test
commit 1bf11aa26d9274186b927ea431b6b54eda0e9999
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed May 25 14:43:38 2022 +0200

    Add arch lister module (origins from archives).
    
    After a first attempt with D7812 this one use a different strategy to
    retrieve origins.
    
    Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
    Parse metadata from 'desc' file to build origins url.
    Scrap the origin url to get artifacts metadata that list all versions of a package.
    
    It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
    
    Related T4233

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/548/ for more details.

Archlinux lister execution on Docker runs fine without any error :

swh@85117fdb1ac4:/$ time swh lister run -l arch

real    50m50.390s
user    9m35.625s
sys     0m32.141s

Listed origins count

swh-scheduler=# select count(*) from listed_origins where visit_type='arch';

count 
-------
31586
(1 row)

swh-scheduler=# select count(*) from listed_origins where visit_type='arch' and url like '%archlinux.org%';
 count 
-------
 12494
(1 row)

swh-scheduler=# select count(*) from listed_origins where visit_type='arch' and url like '%archlinuxarm.org%';
 count 
-------
 19092
(1 row)
This revision is now accepted and ready to land.Jun 16 2022, 4:03 PM