Page MenuHomeSoftware Heritage

nixguix: Deal with manifest entries without an integrity field
ClosedPublic

Authored by ardumont on Oct 5 2022, 4:01 PM.

Details

Summary

In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
integrity one except it's for "recursive" outputHashMode. This adds the necessary
assertions around this case so correct data is sent to loaders as well.

This got detected by runs in docker again [1].

Related to T3781

[1] Related to P1484

Test Plan

tests + docker runs as usual.

Way less skipped artifacts [1] with this compared to before [2]:

[1]

swh-lister_1                         | [2022-10-05 14:14:03,880: WARNING/ForkPoolWorker-1] Cannot detect extension for <https://sources.debian.net/data/main/g/gpsbabel/1.5.3-2/debian/patches/use_minizip>. Fallback to http head query
swh-lister_1                         | [2022-10-05 14:14:03,984: WARNING/ForkPoolWorker-1] Still cannot detect extension through location <https://sources.debian.net/data/main/g/gpsbabel/1.5.3-2/debian/patches/use_minizip>...
swh-lister_1                         | [2022-10-05 14:14:20,777: WARNING/ForkPoolWorker-1] Cannot detect extension for <https://github.com/mpeterv/luazip>. Fallback to http head query
swh-lister_1                         | [2022-10-05 14:14:21,669: WARNING/ForkPoolWorker-1] url <https://github.com/mpeterv/luazip>: detected as 'file' with 'recursive' outputHashMode <{'outputHash': '1jlqzqlds3aa3hnp737fm2awcx0hzmwyd87klv0cv13ny5v9f2x4', 'outputHashAlgo': 'sha256', 'outputHashMode': 'recursive', 'type': 'url', 'urls': ['https://github.com/mpeterv/luazip'], 'integrity': 'sha256-pAuXdvF2hM3ApvOg5nn9EHTGlajujHMtHEoN3Sj+mMo=', 'inferredFetcher': 'unclassified'}>
swh-lister_1                         | [2022-10-05 14:14:21,942: ERROR/ForkPoolWorker-1] Skipping url: <https://github.com/stedolan/jq/archive/jq-1.6.tar.gz>: integrity computation failure for <{'outputHash': 'sha256-CIE8vumQPGK+TFAncmpBijANpFALLTadOvkob0gVzro', 'outputHashAlgo': None, 'outputHashMode': 'recursive', 'type': 'url', 'urls': ['https://github.com/stedolan/jq/archive/jq-1.6.tar.gz'], 'inferredFetcher': 'fetchzip'}>
swh-lister_1                         | Traceback (most recent call last):
swh-lister_1                         |   File "/tmp/tmp.9bwmNMLi9T/swh-lister/swh/lister/nixguix/lister.py", line 412, in get_pages
swh-lister_1                         |     chksum_algo: base64.decodebytes(chksum_b64.encode()).hex()
swh-lister_1                         |   File "/usr/local/lib/python3.7/base64.py", line 546, in decodebytes
swh-lister_1                         |     return binascii.a2b_base64(s)
swh-lister_1                         | binascii.Error: Incorrect padding
swh-lister_1                         | [2022-10-05 14:14:56,954: ERROR/ForkPoolWorker-1] Skipping url: <https://github.com/figiel/hosts/archive/v1.0.0.tar.gz>: integrity computation failure for <{'outputHash': 'sha256-9uF0fYl4Zz/Ia2UKx7CBi8ZU8jfWoBfy2QSgTSwXo5A', 'outputHashAlgo': None, 'outputHashMode': 'recursive', 'type': 'url', 'urls': ['https://github.com/figiel/hosts/archive/v1.0.0.tar.gz'], 'inferredFetcher': 'fetchzip'}>
swh-lister_1                         | Traceback (most recent call last):
swh-lister_1                         |   File "/tmp/tmp.9bwmNMLi9T/swh-lister/swh/lister/nixguix/lister.py", line 412, in get_pages
swh-lister_1                         |     chksum_algo: base64.decodebytes(chksum_b64.encode()).hex()
swh-lister_1                         |   File "/usr/local/lib/python3.7/base64.py", line 546, in decodebytes
swh-lister_1                         |     return binascii.a2b_base64(s)
swh-lister_1                         | binascii.Error: Incorrect padding
swh-lister_1                         | [2022-10-05 14:15:09,152: WARNING/ForkPoolWorker-1] Cannot detect extension for <http://git.marmaro.de/?p=mmh;a=snapshot;h=431604647f89d5aac7b199a7883e98e56e4ccf9e;sf=tgz>. Fallback to http head query
swh-lister_1                         | [2022-10-05 14:16:05,260: INFO/ForkPoolWorker-1] Task swh.lister.nixguix.tasks.NixGuixListerTask[49b5c683-a124-4e6b-8ac5-3d27b9831f02] succeeded in 165.84545156091917s: {'pages': 31563, 'origins': 31494}

[2] P1485

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8631 (id=31156)

Rebasing onto f2377c283a...

Current branch diff-target is up to date.
Changes applied before test
commit fdac840688587eb4d8d242db0a283cb33aaf5773
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 5 15:58:50 2022 +0200

    nixguix: Deal with manifest entries without an integrity field
    
    In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
    integrity one except it's for "recursive" outputHashMode. This adds the necessary
    assertions around this case so correct data is sent to loaders as well.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/762/ for more details.

isn't this missing a new assertion in tests?

isn't this missing a new assertion in tests?

It should not.
I've added 2 origins, one for which it will be caught and skipped (all errors are added in the guix manifest to simplify writing tests).
It's a real outputHash coming from nixpkgs manifest that fails the decoding step for unknown reason.

Another added in the nixpkgs one which is missing the integrity field but provide an outputHash which works (and match the nar).

That should be enough.

I've dropped a wrong assertion i made in the first implementation of the diff (tested with docker which showed me i was wrong).

Build is green

Patch application report for D8631 (id=31157)

Rebasing onto f2377c283a...

Current branch diff-target is up to date.
Changes applied before test
commit 2e6e282d4464da4668eaf0f99fad8139a5d9a653
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Oct 5 15:58:50 2022 +0200

    nixguix: Deal with manifest entries without an integrity field
    
    In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
    integrity one except it's for "recursive" outputHashMode. This adds the necessary
    assertions around this case so correct data is sent to loaders as well.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/763/ for more details.

anlambert added inline comments.
swh/lister/nixguix/tests/data/guix-swh_sources.json
30–39

Only nix manifests have such fields no ?

I guess this one fails to be parsed due to the missing = char at the end of the base64 encoded hash.

You should move it to nixpkgs-swh_sources.json and maybe change the hash value to something
clearer about the fact it is invalid to decode.

swh/lister/nixguix/tests/data/guix-swh_sources.json
30–39

i've kept all the issues in this manifest (whether that's coming from nix or guix is not important since it's dealt with the same in the lister).
And i'd like to keep that way.

I've got 2 tests one nominal (using the nixpkgs manifest), another with "mostly noop" (be it error or not).
The fact that i used guix as this erratic one is completely unfortunate (as it did not represent the reality at all...).

swh/lister/nixguix/tests/data/guix-swh_sources.json
30–39

Ack, maybe you could rename the data file to clearly indicate it contains data with issues in it ?

This revision is now accepted and ready to land.Oct 5 2022, 4:36 PM
swh/lister/nixguix/tests/data/guix-swh_sources.json
30–39

yeah, sounds like a plan (for another diff), thanks for the idea.

ardumont added inline comments.
swh/lister/nixguix/tests/data/guix-swh_sources.json
30–39

In there D8632