Page MenuHomeSoftware Heritage

nixguix: Deal with edge case url with version instead of extension
ClosedPublic

Authored by ardumont on Oct 25 2022, 5:29 PM.

Details

Summary

Prior to this, some urls were detected as file because their version name were wrongly
detected as extension, hence not matching tarball extensions.

Related to T3781

(Note: Connectivity issues on my side)

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32584
Build 51041: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51040: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8773 (id=31631)

Rebasing onto 8355fee25f...

First, rewinding head to replay your work on top of it...
Applying: nixguix: Deal with edge case url with version instead of extension
Changes applied before test
commit e165d09ecd16014e00a4d05d3d2a0e99ac725a9b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 25 17:04:29 2022 +0200

    nixguix: Deal with edge case url with version instead of extension
    
    Prior to this, some urls were detected as file because their version name were wrongly
    detected as extension, hence not matching tarball extensions.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/830/ for more details.

swh/lister/nixguix/lister.py
140

probably too open a regexp...

140

opened*

Improve the regexp version detection to be more restrictive.

plus, Nix and Guix already have the package name and version in their metadata, can't we ask them to provide this data to us?

Build is green

Patch application report for D8773 (id=31635)

Rebasing onto 8355fee25f...

First, rewinding head to replay your work on top of it...
Applying: nixguix: Deal with edge case url with version instead of extension
Changes applied before test
commit 4cc46f34fb4a929afc67d560fd10765c94ec1bbd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 25 17:04:29 2022 +0200

    nixguix: Deal with edge case url with version instead of extension
    
    Prior to this, some urls were detected as file because their version name were wrongly
    detected as extension, hence not matching tarball extensions.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/832/ for more details.

plus, Nix and Guix already have the package name and version in their metadata, can't we ask them to provide this data to us?

er actually, it's the file type we need. But couldn't they provide this as well?

I think detecting versions is a lost cause. Here is what I had to do for PyPI:

https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/migrate_extrinsic_metadata.py$234-281

With this and the next diff, the listing will be complete (in guix at least). I've some
local analysis that i've yet to push again (after some more docker runs) tomorrow.

with lots of annoying examples here:

https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/tests/migrate_extrinsic_metadata/test_pypi.py$62-146

Yes, it's complicated and annoying! I'll look in more details tomorrow to your links.

plus, Nix and Guix already have the package name and version in their metadata, can't

we ask them to provide this data to us? er actually, it's the file type we need. But
couldn't they provide this as well?

I'm pretty sure zimoun would not mind adding those indeed. But that will be enough for
Guix but won't be enough for nixpkgs (I don't really like half fixes).

For the nixpkgs part, I'm not sure they will react, as far as the nixpkgs issues I've
opened [1] [2] goes, nothing bulged in 20 days or so ¯\_(ツ)_/¯.

[1] T4608 (upstream issue linked from it)

[2] T4609 (upstream issue linked from it)

anlambert added a subscriber: anlambert.

I am not a big fan either of adding more tarball detection heuristics but after quickly hacking on the code, it seems this is the only way to handle these edge case URLS (plus there is some cases when even analyzing HTTP headers cannot help to detect if the URL targets a tarball P1512). So let's land this and move on in deploying and testing that lister on staging.

This revision is now accepted and ready to land.Oct 26 2022, 11:34 AM

Rebase (commit id mismatch locally and in the diff, don't recall what changed)
-> must be the commit message.

Build is green

Patch application report for D8773 (id=31638)

Rebasing onto 8355fee25f...

Current branch diff-target is up to date.
Changes applied before test
commit 026fea21da49b30fda9c5adbb24c6d7a8c24c8df
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Oct 25 17:04:29 2022 +0200

    nixguix: Deal with edge case url with version instead of extension
    
    Prior to this, some urls were detected as file because their version name were wrongly
    detected as extension, hence not matching tarball extensions.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/835/ for more details.