Page MenuHomeSoftware Heritage

ingest Guix (SD) packages
Open, NormalPublic

Description

We should extend archive coverage to GuixSD packages.

Event Timeline

zack triaged this task as Normal priority.Nov 16 2018, 12:09 PM
zack created this task.
anadon added a subscriber: anadon.May 16 2020, 11:38 PM

Has there been much movement with this? It looks like only packages relying on git are archived.

ardumont added subscribers: lewo, ardumont.EditedMay 18 2020, 9:15 AM

Hello,

There has been movement in T1991 (which was not referenced as subtask so that
did now show). I fixed that.

For now, we (@lewo and me) ran some tryouts on our staging infra, for nix
sources.

As @lewo mentioned [1], some slight adaptations will be needed for the
equivalent guix sources. And that loader should be able to run for those
as well.

[1] https://forge.softwareheritage.org/D2025#76592

Cheers,

ardumont added a comment.EditedMay 26 2020, 3:37 PM

As a rapid follow up, here is the current structure of the sources.json the
loader nixguix is able to ingest. It's not that much different than what @lewo
initially proposed in the lister diff.

{
  "sources": [                                                            // List of dictionaries representing source artifacts to ingest
    {
      "type": "url",                                                      // the artifact's type, for now we are dealing mostly only with tarball urls
      "urls": [
        "https://some-repository/owner-1/repository-1/revision-1.tgz",    // mirror artifact urls to retrieve data from, that's at least one entry,
                                                                          // the loader currently uses only one but that will be improved upon later
        ...
      ],
      "integrity": "sha256-3vm2Nt+O4zHf3Ovd/qsv1gKTEUwodX9FLxlrQdry0zs="  // integrity field which qualifies uniquely the artifact to ingest [1]
    },
    {
      "type": "url",
      "urls": [ "https://example.com/another-repo.tar.gz", ... ],
      "integrity": "sha256-Q0copBCnj1b8G1iZw1k0NuYasMcx6QctleltspAgXlM="
    },
    ...
 ],
  "version": 1,                                                           // version of the sources.json
  "revision": "cc4e04c26672dd74e5fd0fecb78b435fb55368f7"                  // (specific) nixpkgs repository git revision the loader is targetting,
                                                                          // so that could be here the guixsd repository git revision
}

that sample is an excerpt of the current nixguix loader test sample [2]. That
sources.json could have more information, the loader will not use those extra
fields.

[1] The "integrity" field is following the SRI specification

https://www.w3.org/TR/SRI

[2] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json

Cheers,

Hey @civodul @zimoun!

The nixguix loader is working well since 2 weeks on the nixpkgs sources.json file!
So, we can now consider the sources.json file format as stable and you could make the required changes on your sources.json file. A new SHW origin should then be added.

Once exposed, I think it would be nice to first try to load it on the staging environment!
What do you think @ardumont ?

I'm available on IRC if you need more detail and I would be happy to review your sources.json file!

What do you think @ardumont ?

Sure thing ;)

zack added a comment.Jun 16 2020, 6:34 PM
In T1352#45459, @lewo wrote:

So, we can now consider the sources.json file format as stable and you could make the required changes on your sources.json file. A new SHW origin should then be added.

We need a name for this origin type, one of the hardest problem in CS :-)

Can you suggest something that makes sense for both Nix, Guix, and other players in the field? As an outsider I'm a bit at loss at proposing something…

Repology.org went with "Gnu Guix".

Dear @lewo ,

Thank you for the notification. I have tried to answer by email but I could have failed. Anyway.

The nixguix loader is working well since 2 weeks on the nixpkgs
sources.json file!

Cool!

  • it would be nice if you could use the SRI format for the integrity field (but this is required).
[ 8 more citation lines. Click/Enter to show. ]
  • a sources.json example is available in the test:

https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json

  • it is recommended to filter out all files that are not a tarball to reduce the loader consumption. For each visit, the loader downloads all source urls that are not already archived: if a source url can not be archived, it is currently downloaded at each visit, for nothing. I'm currently using this stupid regex to remove unsupported url from my list: ".tar.gz$|.zip$|tar.bz2$|.tbz$|.tar.xz$|.tgz$|.tar$". This will be improved in the next loader versions.

What do you mean?
Do you mean filter the unsupported urls for the field "urls" in the "type": "url"?
Or do you mean only export "type": "url" and remove all the other types from 'sources.json', for instance "git"?

I'm available on IRC if you need more detail and I would be happy to
review your `sources.json` file!

I will reach there because I have some questions about details. :-)

Thank you this loader! Really cool!

All the best,
simon

Hey @zack,

In T1352#45536, @zack wrote:
In T1352#45459, @lewo wrote:

So, we can now consider the sources.json file format as stable and you could make the required changes on your sources.json file. A new SHW origin should then be added.

We need a name for this origin type, one of the hardest problem in CS :-)

Can you suggest something that makes sense for both Nix, Guix, and other players in the field? As an outsider I'm a bit at loss at proposing something…

As I see it, the loader and sources.json format have little to do with Nix and Guix: they're about listing "sources" in a broad sense.

@lewo, are you suggesting that sources.json itself be an "origin"? Isn't sources.json on a different level than the git or tar origins?

lewo added a comment.Jun 17 2020, 3:13 PM

@zimoun

Do you mean filter the unsupported urls for the field "urls" in the "type": "url"?
Or do you mean only export "type": "url" and remove all the other types from 'sources.json', for instance "git"?

Only the type "url" is currently taken into account by the loader. So, sources of type "git" are just ignored by the loader.
Regarding the type url, the loader is however only able to archive tarballs. If you expose a JAR in your sources.json file (with type url), the loader will download it and will fail to archive it. It will however try to do this for each visit (this would consume time and resources for nothing). IIRC, I had around 5000 sources of type url which were not tarballs: I removed them from my sources.json file.

I will reach there because I have some questions about details. :-)

Sure, that could be easier;)

@civodul

are you suggesting that sources.json itself be an "origin"?

The sources.json URL is an "origin". Each snapshot associated to this origin has several branches. Each branch corresponds to a source of the sources.json file.
There is also special branch named evaluation which points to the commit specified by the attribute revision of your sources.json file: this is to link a snapshot to a nixpkgs/guix commit.

lewo added a comment.Jun 17 2020, 3:15 PM

@zack

We need a name for this origin type, one of the hardest problem in CS :-)

Where is it used? Is it a new attribute?
We actually had to choose a name for the visit type, and with a lot of inspiration, we choose nixguix :-/

zack added a comment.Jun 17 2020, 3:52 PM

@lewo it's used in our DB but also exposed in the swh-web UI in search results (and in the future it is going to be also be a field for user searches, so that you can search, e.g., "emacs" only in the list of packages archived from a given origin type).

nixguix could do, it's not very inclusive of other projects in the future that might adopt the same format, but it's not a big deal (and we can change it later if need be).

In T1352#45587, @lewo wrote:

are you suggesting that sources.json itself be an "origin"?

The sources.json URL is an "origin". Each snapshot associated to this origin has several branches. Each branch corresponds to a source of the sources.json file.
There is also special branch named evaluation which points to the commit specified by the attribute revision of your sources.json file: this is to link a snapshot to a nixpkgs/guix commit.

Oh I see, I had completely overlooked that.

In T1352#45591, @zack wrote:

nixguix could do, it's not very inclusive of other projects in the future that might adopt the same format, but it's not a big deal (and we can change it later if need be).

Maybe sources, sourcelist, or something similar?