Page MenuHomeSoftware Heritage
Paste P1489

File with "recursive" outputHashAlgo: cannot do anything about it (missing information in manifest, strange fs layout manipulation)
ActivePublic

Authored by ardumont on Oct 6 2022, 10:43 AM.
cat /var/tmp/sources-unstable-full.json | jq . | grep -C6 'https://www.unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt'
{
"outputHash": "0s2mvy1nr2v1x0rr1fxlsv8ly1vyf9978rb4hwry5vnr678ls522",
"outputHashAlgo": "sha256",
"outputHashMode": "recursive",
"type": "url",
"urls": [
"https://www.unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt"
],
"integrity": "sha256-QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg=",
"inferredFetcher": "unclassified"
},

Event Timeline

$ file emoji-zwj-sequences.txt
emoji-zwj-sequences.txt: UTF-8 Unicode text

$ ipython
In [1]: import base64

In [2]:  integrity = "sha256-QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg="; base64.decodebytes(integrity.split("-")[1].encode()).hex()
Out[2]: '42144dd131d9eee2338764657452727e074fd1d6b4bb9033e8618b6c83df5568'

In [3]:  integrity = "sha256-0s2mvy1nr2v1x0rr1fxlsv8ly1vyf9978rb4hwry5vnr678ls522"; base64.decodebytes(integrity.split(
   ...: "-")[1].encode()).hex()
Out[3]: 'd2cda6bf2d67af6bf5c74aebd5fc65b2ff25cb5bf27fdf7bf2b6f8870af2e6f9ebebbf25b39db6'

$ nix-store --dump emoji-zwj-sequences.txt | sha256sum
8e4da5a445465874d79bd980411320f71385927ff7d767e69ef4ecdf369bafc9  -

$  sha256sum emoji-zwj-sequences.txt
98ff05deef36f30bb16d92f1e470f277d412d8f047c7b4b47943bfcbcf0b3097  emoji-zwj-sequences.txt
$ nix-hash --type sha256 --to-base32 $(python3 -c 'import base64; print(base64.b64decode("QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg=").hex())')
0s2mvy1nr2v1x0rr1fxlsv8ly1vyf9978rb4hwry5vnr678ls522

Awesome, thx. That means that in that case, the outputHash (base32) and the integrity field (base64) match!

Now remains for me to understand how to check that checksum though...

That still does not match ¯\_(ツ)_/¯:

$ nix-store --dump emoji-zwj-sequences.txt | sha256sum
8e4da5a445465874d79bd980411320f71385927ff7d767e69ef4ecdf369bafc9  -
$ nix-hash --type sha256 emoji-zwj-sequences.txt
8e4da5a445465874d79bd980411320f71385927ff7d767e69ef4ecdf369bafc9

$ ipython
...
In [2]:  integrity = "sha256-QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg="; base64.decodebytes(integrity.split("-")[1].encode()).hex()
vvOut[2]: '42144dd131d9eee2338764657452727e074fd1d6b4bb9033e8618b6c83df5568'

I figured it out. Reading the source, I noticed that file is somehow downloaded to share/unicode/emoji. So if I reproduce this FS layout before turning it into a NAR, I can reproduce the hash in the manifest:

$ mkdir -p foobar/share/unicode/emoji
$ wget https://www.unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt -O foobar/share/unicode/emoji/emoji-zwj-sequences.txt -q
$ nix-store --dump foobar | sha256sum                                         
42144dd131d9eee2338764657452727e074fd1d6b4bb9033e8618b6c83df5568  -
$ python3
>>> import base64
>>> base64.b64encode(bytes.fromhex("42144dd131d9eee2338764657452727e074fd1d6b4bb9033e8618b6c83df5568"))
b'QhRN0THZ7uIzh2RldFJyfgdP0da0u5Az6GGLbIPfVWg='

Nice catch! I did not realize the fs layout when reading this derivation.

So conclusion, as we don't have the fs layout from the nixpkgs manifest, we cannot do anything about this case.
The loader will simply "hash mismatch" on it and the origin will fail the ingestion.

So we must notify upstream about the missing information for those origins in the manifest and ignore those origins from the listing in the mean time [1]

[1] T4608

ardumont changed the title of this paste from File with "recursive" outputHashAlgo to File with "recursive" outputHashAlgo: cannot do anything about it (missing information in manifest, strange fs layout manipulation).Oct 6 2022, 2:26 PM