Page MenuHomeSoftware Heritage

Consider archiving NAR hashes
Open, NormalPublic

Description

(Brain dump after a discussion with @zimoun at the 10 Years of Guix Event)

Guix (and Nix?) manifests store a checksum of the source/input files downloaded by the manifest. For tarballs, this is a hash of the tarball. For Git repo, it's the commit or tree hash (IIRC). For other VCSs without their own intrinsic hashes (Subversion, CVS, ...), Guix hashes the directory tree of the tip commit by formatting them in the NAR (Nix Archive) format.

We should investigate this format. If it is simple enough, we should be able to recompute this NAR manifest and store its hash as ExtID so it is available to Guix (and Nix?) via the vault.

Event Timeline

vlorentz triaged this task as Normal priority.Sep 18 2022, 6:54 PM
vlorentz created this task.

yes, nix too. As far as i remember, it's NAR for Nix ARchive ;)

Hi,

Thanks for raising the questions.

One of the issue is to lookup (key)----considering the information we have at package time. Today, the only intrinsic information that Guix always keeps is the integrity (computed by NAR serializer and hashed by SHA-256). For tarballs (url-fetch), Disarchive somehow provides a dictionnary from this Guix integrity to SWH.

For Git (git-fetch), it is less clear. I would like to store the Git commit hash at package time in Guix. But there is no consensus inside the community. Well, in this case, Guix fetches back from SWH using the URL + Git tag.

For other VCSs, nothing is implemented. For instance, IIUC about Subversion [1], the way to fallback would be to use snapshots which are not implemented yet.

Well, you will tell me to store swhid at package time. :-) Maybe, but it means that 20k+ packages should be updated. Perhaps, instead, a map from various integrity computations to swhid (something similar to Disarchive, somehow), could help; especially for other VCSs as Subversion, CVS, Bzr, etc. At least, NAR would help, sure! :-)

1: https://issues.guix.gnu.org/issue/43442#11

Related to P1473.

Guix provides guix hash where the option named --recursive is confusing (at least confused me :-)). Considering a random Git repository, say this one, then,

$ git clone https://gitlab.inria.fr/guix-hpc/guix-modules
$ git -C  guix-modules checkout v0.1.0
$ guix hash -S nar -f hex -H sha256 guix-modules
75b5d7448bffb10ecf367e424c0941e059bab1aea0036c5c45e6b337db0e097

and the naive Python equivalent -- closely following Nix ARchive format described in here p.100 and following (or p.92 depending how you are counting ;-) well around section 5.2.2) -- looks like:

import os
import stat
import hashlib
from pathlib import Path

def str_(thing):
    # name 'str' in Figure 5.2 p.93 (page 101 of pdf)
    if isinstance(thing, str):
        byte_sequence = thing.encode()
    else:
        byte_sequence = thing
    l = len(byte_sequence)
    blen = l.to_bytes(8, byteorder='little') # 64-bit little endian
    m = l % 8
    if m == 0:
        offset = 0
    else:
        offset = 8 - m
    return blen + byte_sequence + bytearray(offset)

def serialise_prime_prime(fso):
    mode = os.lstat(fso).st_mode

    if stat.S_ISREG(mode):
        bstr = str_("type") + str_("regular")
        contents = b''
        with open(str(fso), "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                contents += chunk
        if os.access(fso, os.X_OK):
            bstr += str_("executable") + str_("")
        return bstr + str_("contents") + str_(contents)

    elif stat.S_ISLNK(mode):
        bstr = str_("type") + str_("symlink")
        target = os.readlink(fso).encode()
        return bstr + str_("target") + str_(target)

    elif stat.S_ISDIR(mode):
        bstr = str_("type") + str_("directory")
        for path in sorted(Path(fso).iterdir()):
            bstr += serialiseEntry(path.name.encode(), path)
        return bstr

    else:
        raise ValueError("unsupported file type")

def serialise_prime(fso):
    return str_("(") + serialise_prime_prime(fso) + str_(")")

def serialiseEntry(name, fso):
    return str_("entry") + str_("(")         \
        + str_("name") + str_(name)           \
        + str_("node") + serialise_prime(fso) \
        + str_(")")

def serialise(fso):
    return str_("nix-archive-1") + serialise_prime(fso)

# Entry point
def nar_hash(thing):
    h = hashlib.sha256()
    s = serialise(thing)
    h.update(s)
    return h.hexdigest()
Python 3.9.9 (main, Jan  1 1970, 00:00:01) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %run narhasher.py

In [2]: nar_hash("guix-modules")
Out[2]: '75b5d7448bffb10ecf367e424c0941e059bab1aea0036c5c45e6b337db0e097d'

Guix does not encode using hex but nix-base32 but it does not really matter here.

Hope that helps.

To be complete, the integrity checksum that Guix provides (here one example about the package guix-modules),

$ git clone https://gitlab.inria.fr/guix-hpc/guix-modules
$ git -C  guix-modules checkout v0.1.0
$ guix hash -S nar -f nix-base32 -H sha256 -x guix-modules
1ckvrrmkgzz93i35sj1372wxs7ln4gzszpri1pcdf473z0p7nh7w

which is the same as in the link above . The .git directory is excluded in the serialization.

However, note that Guix does not consider NAR serializer as integrity when it is about tarball. For example, consider the tarball of the package hello, then the checksum is 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd

$ guix hash -S none -f nix-base32 -H sha256 hello-2.12.1.tar.gz
086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd

Therefore, if instead nix-base32 encoding, we pick hex, it reads,

$ guix hash -S none -f hex -H sha256 hello-2.12.1.tar.gz
8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20
$ sha256sum hello-2.12.1.tar.gz
8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20  hello-2.12.1.tar.gz

This tree

$ tree foo
foo
├── bar
│   └── exe
└── baz

1 directory, 2 files

serializes as

nix-archive-1
  (
  type
  directory
    entry
    (
    name
    bar
    node
      (
      type
      directory
        entry
        (
        name
        exe
        node
          (
          type
          regular
          executable
          
          contents
          <_io.BufferedReader name='foo/bar/exe'>
          )
        )
      )
    )
    entry
    (
    name
    baz
    node
      (
      type
      regular
      contents
      <_io.BufferedReader name='foo/baz'>
      )
    )
  )

where the files baz and exe are empty and chmod u+x exe. The sha1 in hex format is a81bdeb702409399f83ba91cf95cd6bcab800ad7. Note the indentation is only added for readibility.

Now, let consider a large directory, with many Git repository including Guix or Emacs ones.

$ find ~/src/ -type f -print | wc -l
91419

$ find ~/src/ -type d -print | wc -l
5218

$ du -sh ~/src/
11G	/home/simon/src/

and compare guix hash and some Python implementation,

$ time guix hash -S nar -H sha256 -f base64 ~/src/
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m27,476s
user	0m36,002s
sys	0m2,279s

$ time python3.9 narserializer.py ~/src | sha256sum | (xxd -r -p | base64)
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m45,190s
user	0m49,130s
sys	0m5,906s

Only the serialization part takes,

$ time python3.9 narserializer.py ~/src > /dev/null

real	0m5,328s
user	0m3,523s
sys	0m1,797s

Since we are interested in the final hash, let compute hash when walking the tree,

$ time python3.9 narserializer.py ~/src sha256 base64
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m25,883s
user	0m24,293s
sys	0m1,584s

Profiling this Python implementation, it reads,

      9/1    0.000    0.000   28.874   28.874 {built-in method builtins.exec}
        1    0.000    0.000   28.874   28.874 narserializer.py:1(<module>)
        1    0.000    0.000   28.869   28.869 narserializer.py:92(serialize)
  97997/1    0.463    0.000   28.869   28.869 narserializer.py:56(_serialize)
 97996/10    0.182    0.000   28.869    2.887 narserializer.py:85(_serializeEntry)
1366550/675860    2.008    0.000   25.675    0.000 narserializer.py:13(str_)
  3663342   21.344    0.000   21.344    0.000 {method 'update' of '_hashlib.HASH' objects}
   339543    0.104    0.000    1.542    0.000 narserializer.py:43(<lambda>)
   339551    1.438    0.000    1.438    0.000 {method 'read' of '_io.BufferedReader' objects}
     5218    0.098    0.000    0.789    0.000 {built-in method builtins.sorted}
    97997    0.302    0.000    0.626    0.000 {built-in method posix.lstat}
    91419    0.524    0.000    0.524    0.000 {built-in method io.open}
   688892    0.287    0.000    0.465    0.000 pathlib.py:796(__lt__)

where most of the time is spent for hashing.

import os, stat, hashlib, io
from pathlib import Path

import sys, base64

class Nar:
    def __init__(self, updater, isdebug=False):
        self._update = updater

        self.__isdebug = isdebug
        self.__indent = 0

    def str_(self, thing):
    # named 'str' in Figure 5.2 p.93 (page 101 of pdf)

        if self.__isdebug \
           and (isinstance(thing, str) or isinstance(thing, io.BufferedReader)):
            indent = "".join(["  " for _ in range(self.__indent)])
            print(indent + str(thing))

        # named 'int'
        if isinstance(thing, str):
            byte_sequence = thing.encode('utf-8')
            l = len(byte_sequence)
        elif isinstance(thing, io.BufferedReader):
            l = os.stat(thing.name).st_size

        # ease reading of _serialize
        elif isinstance(thing, list):
            for stuff in thing:
                self.str_(stuff)
            return
        else:
            raise ValueError("not string nor file")

        blen = l.to_bytes(8, byteorder='little') # 64-bit little endian
        self._update(blen)

        # first part of 'pad'
        if isinstance(thing, str):
            self._update(byte_sequence)
        elif isinstance(thing, io.BufferedReader):
            for chunk in iter(lambda: thing.read(2*2*2*2*4096), b""):
                self._update(chunk)

        # second part of 'pad
        m = l % 8
        if m == 0:
            offset = 0
        else:
            offset = 8 - m
        boffset = bytearray(offset)
        self._update(boffset)


    def _serialize(self, fso):
        if self.__isdebug: self.__indent += 1
        self.str_("(")

        mode = os.lstat(fso).st_mode

        if stat.S_ISREG(mode):
            self.str_(["type", "regular"])
            if os.access(fso, os.X_OK):
                self.str_(["executable", ""])
            self.str_("contents")
            with open(str(fso), "rb") as f:
                self.str_(f)

        elif stat.S_ISLNK(mode):
            self.str_(["type", "symlink", "target"])
            self.str_(os.readlink(fso))

        elif stat.S_ISDIR(mode):
            self.str_(["type", "directory"])
            for path in sorted(Path(fso).iterdir()):
                self._serializeEntry(path)

        else:
            raise ValueError("unsupported file type")

        self.str_(")")
        if self.__isdebug: self.__indent += -1

    def _serializeEntry(self, fso):
        if self.__isdebug: self.__indent += 1
        self.str_(["entry", "(", "name", fso.name, "node"])
        self._serialize(fso)
        self.str_(")")
        if self.__isdebug: self.__indent += -1

    def serialize(self, fso):
        self.str_("nix-archive-1")
        self._serialize(fso)
        return


if __name__ == "__main__":
    directory = sys.argv[1]
    try:
        if sys.argv[2] == "sha1":
            h = hashlib.sha1()
        else:
            h = hashlib.sha256()
        updater = h.update
        try:
            if sys.argv[3] == "hex":
                convert = lambda hsh: hsh
            elif sys.argv[3] == "base64": # hex -> base64
                convert = lambda hsh: base64.b64encode(bytes.fromhex(hsh)).decode()
            elif sys.argv[3] == "BASE32":
                convert = lambda hsh: base64.b32encode(bytes.fromhex(hsh)).decode()
            elif sys.argv[3] == "base32":
                convert = lambda hsh: base64.b32encode(bytes.fromhex(hsh)).decode().lower()
            else:
                convert = lambda hsh: hsh
        except:
            convert = lambda hsh: hsh
        isPrintable = True
    except:
        updater = sys.stdout.buffer.write
        isPrintable = False

    d = os.environ.get('DEBUG')
    if d and d != "no":
        debug = True
    else:
        debug = False
    nar = Nar(updater, debug)
    nar.serialize(directory)
    if isPrintable:
        print(convert(h.hexdigest()))

fwiw, I've iterated a bit over @zimoun's code and pushed it into the snippets repository
(see commits above and their commit description message). It's also able to deal with
git, hg and svn trees (ignoring their respective top metadata folder .git, .svn, ...
without impacting the performance).

The algorithm seems simple enough.

We may want to adapt, industrialize and push this within swh-model to expose functions
so loaders could compute and store that in the extid part (as per the description need
above). That could also avoid depending on nix (or guix) binaries for the deployment of
the new {Content|Directory}Loader [1] [2].

The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).

This may not be too much of a performance hit for the new {Content|Directory}Loader [1]
[2] but that may be a blocker for vcs repository loaders though... (some can take their
time depending on the origin to ingest).

Another way would be to integrate within one of the swh.model modules that deals with
merkle/fs structures. I don't think plugging this directly inside
swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules
swh.model.merkle or swh.model.from_disk? But would that help?

@vlorentz @olasd What do you think?

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.py$757-776

[2] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/core/loader.py$897-917

Hi,

Oh awesome! Thanks for pushing forward. :-)

The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).

Where your intrinsic identifiers are they computed? Because one solution is to walk the arborescence and feed various serializers (swhid, git, nar, etc.). Just to note that maybe in the future you would like support other intrinsic serializers. Well, I do not know, maybe it is a premature optimization. :-)

Oh awesome! Thanks for pushing forward. :-)

;)

The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).

Where your intrinsic identifiers are they computed?

A loader is really a mapping "function" of reading the origin (vcs, tarball, package,
...) to swh.model.model Content, Directory, Revision, Release and Snapshot model objects
[1]. Those integrates intrinsic id computation method. See for example, git [2], hg [3].
For package loader, we directly walk the filesystem through
swh.model.from_disk.Directory.from_disk method [4], hence my previous question quoted
below.

Another way would be to integrate within one of the swh.model modules that deals with
merkle/fs structures. I don't think plugging this directly inside
swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules
swh.model.merkle or swh.model.from_disk?

[1] https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/model.py

[2] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/loader.py$392-425

[3] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/loader.py$392-425

[4] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/loader.py$829-832

Because one solution is to walk the arborescence and feed various serializers (swhid,
git, nar, etc.).

Indeed, that's already the case in our MultiHash module [5] (which our model objects
in-fine depends upon). But so far, the hashes in question were "flat" hash algorithm
(without any extra mapping as the NAR defines).

[5] https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/hashutil.py$6-52

Just to note that maybe in the future you would like support other intrinsic
serializers.

While we may be willing to do the extra effort for the NAR format to ease our
integration with guix/nix, I don't think that's the way forward for us though. We'd like
various package managers to start using standardized intrinsic identifiers hence our
effort on standardizing the SWHID format (which everybody can compute without the swh
stack).

Well, I do not know, maybe it is a premature optimization. :-)

possibly ;)