Page MenuHomeSoftware Heritage

Replace the Nixguix loader with a lister
Open, NormalPublic

Description

Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong.
We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests.
This would be closer to what we do with Debian/Ubuntu.

Define the following (see the hedgedoc [1] which details a proposition):

  • target structure sketch of the data in the archive
  • What are the origin urls?
  • what kind of extrinsic metadata and/or extids are we storing?
  • what kinds of snapshots we're generating

Plan:

  • D8341: Implement lister

- [ ] D8406, ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests (cannot work [2])

  • D8581: Implement ContentLoader (possibly as a package [2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests)
  • D8584: Implement DirectoryLoader (possibly as a package [2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests)
  • Update implementations ^ dealing with unsupported integrity hash (sha512)
  • Run through docker
  • Deploy in staging
  • Call for public review
  • Deploy in production when ok ^

[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw

[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.

Event Timeline

vlorentz triaged this task as Normal priority.Dec 8 2021, 6:45 PM
vlorentz created this task.

Not saying no.

It also feels weird from the scheduler stand point. Today, it is definitely configured
like a lister (recurring task the old way like a lister).

It was initially discussed in the bootstrap diff about it [1] and was dismissed for some
reasons i don't remember (one must dig in that diff/task discussion).

As that diff was proposed in a time anterior to the actual scheduler/lister refactoring
we started this year, it may be interesting to revisit this idea indeed.

[1] T1991 D2025

Maybe another data point for the discussion is that the nixguix loader currently only
shows 1 origin for guix and 1 for nixos [well nixpkgs really} in the coverage part [1].
Which is somewhat true... but... feels weird at the same time.

[1] https://archive.softwareheritage.org/

I'm growing fond of this idea.
That should take less time to refactor it now that we improved the lister scaffolding and
that we mostly know what the perimeter of the nixguix loader is.

As a note, we may have to write a "content" loader as the manifests (currently read by the loader)
reference single files as well (which are currently ignored/bypassed by the current loader implementations).

Another argument: currently, there is always at least some failures when loading real Nix and Guix repositories, so visits always have status partial; which prevents them from being listed in https://archive.softwareheritage.org/browse/search/?q=&with_visit=true&with_content=true&visit_type=nixguix (but we get results when un-checking " only show origins visited at least once")

(not that these origins are particularly useful anyway, which is also the point of this task)

Another argument: currently, there is always at least some failures when loading real
Nix and Guix repositories, so visits always have status partial; which prevents them
from being listed in
https://archive.softwareheritage.org/browse/search/?q=&with_visit=true&with_content=true&visit_type=nixguix
(but we get results when un-checking " only show origins visited at least once")

indeed

So taking a bit more look into this possible new lister, we'd end up with the following
possible outputs:

  • artifacts url which are mostly tarballs [1] and sometimes files [2]
  • dvcs repositories delegated to dedicated loader to ingestion: svn [3], hg [4], git [5] (out of guix manifest)
  • Other stuff can be ignored as we don't have anything relevant to ingest [6]

Regarding urls, what do we do? Do we create as much origins as there are urls (after an
existence check on it?). I do not see a proper way around it.

Note also that going forward, the most probable outcome regarding the nixguix loader
would be that it disappears altogether.

Thoughts?

[1]

{
  "outputHash": "1rzz7yhqq3lljyqxbg46jfzfd09qgpgx865lijr4sgc94riy1ypn",
  "outputHashAlgo": "sha256",
  "outputHashMode": "recursive",
  "type": "url",
  "urls": [
    "https://downloads.sourceforge.net/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://prdownloads.sourceforge.net/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://netcologne.dl.sourceforge.net/sourceforge/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://versaweb.dl.sourceforge.net/sourceforge/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://freefr.dl.sourceforge.net/sourceforge/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://osdn.dl.sourceforge.net/sourceforge/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz",
    "https://kent.dl.sourceforge.net/sourceforge/cm-unicode/cm-unicode/0.7.0/cm-unicode-0.7.0-otf.tar.xz"
  ],
  "integrity": "sha256-9vrgYyaJPU2yjLQY1N99OIHmvpOGvNWxl5QOjKE//+c=",
  "inferredFetcher": "unclassified"
},

[2]

grep '\.py' guix-sources.json | grep -v pythonhosted | grep '.py'
        "https://downloads.python.org/pypy/pypy3.7-v7.3.5-src.tar.bz2"
        "https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tar.xz"
      "git_url": "https://github.com/bastikr/boolean.py",
        "https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tar.xz"
        "https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tar.xz"
      "git_url": "https://github.com/plotly/plotly.py",
      "git_url": "https://github.com/leohemsted/smartypants.py",
        "https://www.python.org/ftp/python/2.7.18/Python-2.7.18.tar.xz"
        "https://www.python.org/ftp/python/2.7.18/Python-2.7.18.tar.xz"
        "http://www.home.unix-ag.org/simon/woof-2012-05-31.py"

(it's also noisy apparently ^)

[3]

{
  "type": "svn",
  "svn_url": "svn://www.tug.org/texlive/tags/texlive-2021.3/Master/texmf-dist/",
  "svn_revision": 59745
},

[4]

{
  "type": "hg",
  "hg_url": "https://hg.sr.ht/~olly/yoyo",
  "hg_changeset": "v7.2.0-release"
},

[5]

{
  "type": "git",
  "git_url": "https://github.com/gdraheim/zziplib",
  "git_ref": "v0.13.72"
},

[6]

# all possible types in nixpkgs is only url
$ grep '"type"' nixpkgs-sources-unstable.json | sort | uniq
      "type": "url",
# all possible types in guix are a bit more involved
$ grep '"type"' guix-sources.json | sort | uniq
      "type": false
      "type": "git",
      "type": "hg",
      "type": "no-origin",
      "type": "svn",
      "type": "url",

# but 'no-origin' and 'false' are just noise
    {
      "type": "no-origin",
      "name": "xfs_repair-static"
    },
    {
      "type": false
    },
ardumont renamed this task from Replace the Nixguix loader with a lister? to Replace the Nixguix loader with a lister.Jun 29 2022, 11:05 AM
ardumont updated the task description. (Show Details)

Some more information regarding extensions supported in nixpkgs and guix manifests:

In [33]: sources = "https://nix-community.github.io/nixpkgs-swh/sources-unstable.json"

In [34]: data = requests.get(sources).json()

In [35]: packages = data["sources"]

In [36]: set([Path(url[0]).suffixes[-1] for url in [package["urls"] for package in packages] if Path(url[0]).suffixes])
Out[36]:
{'.3;sf=tgz',
 '.bz2',
 '.git;a=snapshot;h=5ca4ca92f629d9d83e83544b9239abaaacf0a527;sf=tgz',
 '.git;a=snapshot;h=V7_5_1;sf=tgz',
 '.gz',
 '.tar',
 '.tbz',
 '.tgz',
 '.xz',
 '.zip'}

In [37]: sources = "https://guix.gnu.org/sources.json"

In [38]: data2 = requests.get(sources).json()

In [39]: packages2 = data2["sources"]

In [45]: set([Path(url[0]).suffixes[-1] for url in [package["urls"] for package in packages2 if package["type"] == "url"] if Path(url[0]).suffixes])
Out[45]:
{'.0',
 '.1',
 '.10',
 '.14',
 '.15',
 '.18',
 '.19',
 '.2',
 '.4',
 '.5',
 '.6',
 '.7z',
 '.9',
 '.Z',
 '.bz2',
 '.c',
 '.cfg?revision=59745',
 '.el',
 '.el?id=dcc9ba03252ee5d39e03bba31b420e0708c3ba0c',
 '.gem',
 '.gz',
 '.gz?uuid=tklib-0-6',
 '.jar',
 '.lisp',
 '.love',
 '.lz',
 '.lzma',
 '.map',
 '.py',
 '.sf3',
 '.tar',
 '.tbz',
 '.tgz',
 '.ttf',
 '.txz',
 '.xz',
 '.z',
 '.zip',
 '.zst'}

Hi,

  • artifacts url which are mostly tarballs [1] and sometimes files [2]
  • dvcs repositories delegated to dedicated loader to ingestion: svn [3], hg [4], git [5] (out of guix manifest)
  • Other stuff can be ignored as we don't have anything relevant to ingest [6]

Just to be on the same wavelength. :-) Basically a Guix package looks like [1] where we are mainly interested by the origin field. Guix defines various "fetchers"; as url-fetch, git-fetch, hg-fetch, svn-fetch, cvs-fetch, bzr-fetch; which are self explanatory, isn't it? ;-)

The creation of sources.json is done by walking through all the packages, extracting the field origin and writing JSON depending on the "fetcher" methods; see [2]. Other said, only the metadata of one package is considered.

About mirror, we maintain a list [3] and it is expanded (resolve) when sources.json is created.

Nothing is filtered out and the file sources.json contains the raw URI used by Guix to build all the packages. It means that sometimes the source is a script or a more convoluted thing; see the packages Icecat [4] or Linux libre [5]. We can discuss separately these corner-cases.

Last, some elements about # all possible types in guix are a bit more involved.

$ cat sources.json | jq | grep '"type"' | sort | uniq -c | sort -n
      7       "type": false
     36       "type": "hg",
     88       "type": "no-origin",
    392       "type": "svn",
   7351       "type": "git",
  13602       "type": "url",

Hum, for the 7 false, I have to check. For the 88 packages with no-origin, it is more annoying. Well, some are metapackages as gcc-toolchain, so they can be skipped. Is it ok for you to let this 'no-origin' type? For some others, I have to check if they are covered elsewhere.

Thanks for working on this improvement. :-)

1: https://archive.softwareheritage.org/swh:1:cnt:4bdc3e77922fd7133c3e765dd0f62e299e1dfbb5;origin=https://git.savannah.gnu.org/git/guix.git;visit=swh:1:snp:045d2f1b56a871f43afce5eedafaae7f910ff5d7;anchor=swh:1:rev:eb52b240eb58627cc76a03244c45502c4ec3c50e;path=/gnu/packages/base.scm;lines=85-103
2: https://archive.softwareheritage.org/swh:1:cnt:b08ba2ea2f5b7823b3dfaf0b429e98d04e2eaebd;origin=https://git.savannah.gnu.org/git/guix/guix-artwork.git;visit=swh:1:snp:d5ca15e25a2fb9df3888dfbb0c89daacf29bcb04;anchor=swh:1:rev:71bfafb276096bcbe8bd84e2db27279e33451161;path=/website/apps/packages/builder.scm;lines=99
3: https://archive.softwareheritage.org/swh:1:cnt:d459ba8cf12dbd1b934f923a03288e4681df596d;origin=https://git.savannah.gnu.org/git/guix.git;visit=swh:1:snp:045d2f1b56a871f43afce5eedafaae7f910ff5d7;anchor=swh:1:rev:eb52b240eb58627cc76a03244c45502c4ec3c50e;path=/guix/download.scm;lines=53
4: https://archive.softwareheritage.org/browse/content/sha1_git:4c08ef609548f398f9fc14a622c4ccf0013fa839/?origin_url=https://git.savannah.gnu.org/git/guix.git&path=gnu/packages/gnuzilla.scm#L424-L425
5: https://archive.softwareheritage.org/swh:1:cnt:d8f1f6912e6273dff53e76e75f4e29b555e4dd90;origin=https://git.savannah.gnu.org/git/guix.git;visit=swh:1:snp:045d2f1b56a871f43afce5eedafaae7f910ff5d7;anchor=swh:1:rev:eb52b240eb58627cc76a03244c45502c4ec3c50e;path=/gnu/packages/linux.scm;lines=238-245

Thanks for all that ^! And great pointers!

About mirror, we maintain a list [3] and it is expanded (resolve) when sources.json is
created.

Great, so @vlorentz, after discussing with zimoun a bit more about those mirror urls ^,
we should be able to not store all of those tarball urls as origins. The first one
should be considered the canonical url and the others fallback to retrieve the same
tarballs if the canonical one goes 404.

That's pretty much a good news, less origin creations.

Hum, for the 7 false, I have to check. For the 88 packages with no-origin, it is more
annoying. Well, some are metapackages as gcc-toolchain, so they can be skipped. Is it
ok for you to let this 'no-origin' type? For some others, I have to check if they are
covered elsewhere.

We'll skip them during the listing. So it's fine if that stays within the json file.
At some point, we'll adapt if we find a more suitable way to deal with those.

ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)