Page MenuHomeSoftware Heritage

[wip] swh.lister.functionalPackages: add lister getting sources from a JSON file
AbandonedPublic

Authored by lewo on Sep 22 2019, 8:59 PM.

Details

Summary
swh.lister.functionalPackages: add lister getting sources from a JSON file

This lister downloads a JSON file containing a list of sources
provided by the NixOS and Guix distribution. This file looks like:

    {
      "version": 1
      "sources": [
        {
          "type": "url",
          "url": "https://ftpmirror.gnu.org//hello/hello-2.10.tar.gz"
        }
      ],
    }

This is a work in progress lister and we need to work on several points:

  • define a JSON format
  • expose the JSON file from a NixOS community managed server (edit(lewo): i'm working on this)

Diff Detail

Repository
rDLS Listers
Branch
json-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10163
Build 15077: tox-on-jenkinsJenkins
Build 15076: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lewo added a comment.Oct 17 2019, 9:41 AM

@ardumont @douardda Regarding the lister name, I agree the JSONLister name is not appropriate. @ardumont proposed FunctionalPackageManagerLister or FunctionalPackageLister but I'm not sure they are appropriated too :)

The file ingested by this lister already contains different kind of elements:

If the goal is to provide reproducibility for some package managers, in the future, this could also contains git revisions, or Docker images, png images...

I don't know what is the/your definition of "package", but if we agree on "a patch is not a package", then this lister is more than a "package" lister!

Also, this lister will be used by Guix and Nix, but it could also be used by others. It could provide a quick way for communities to archive their sources with softwareheritage: it's much more simpler to generate and expose a JSON file, ask softwareheritate to setup a lister pointing to this file rather than writing a dedicated lister. For instance, Terraform could expose a file containing pointers to the sources of all the modules they provide (see https://registry.terraform.io/).

So, what about sourcesLister, sourceListLister?

Of course, I'd be 100% ok if you want to use FunctionalPackageLister (which could be easily renamed later if needed) :)

@ardumont @douardda Regarding the lister name, I agree the JSONLister name is not appropriate. @ardumont proposed FunctionalPackageManagerLister or FunctionalPackageLister but I'm not sure they are appropriated too :)

Well, i was trying to make stand-out the common (and awesome) nature of both guix and nix package managers ;)

The file ingested by this lister already contains different kind of elements:

Yes, and it's fine.
It's source code forms.
It should be listed amongst the artifacts for the origin in question.

If the goal is to provide reproducibility for some package managers, in the future, this could also contains git revisions, or Docker images, png images...

The goal is archiving whatever is source code past, present and future.
What we can do further down the line can be reproducibility among other things.
But that would be the swh clients to determine that (using most probably some swh mirrors yet to come ;)

And sure, the lister can scheduled 'git' type origins already.
Or other kind of loading task types by the way (be that: git, svn, hg, debian, deposit, tar, npm, pypi ...).
Actually, there is at least 2 listers which do that already (bitbucket lists hg and git repositories, packagist should be able to list different dvcs as well).

I don't know what is the/your definition of "package", but if we agree on "a patch is not a package", then this lister is more than a "package" lister!

A patch is not a package indeed.
It's a source code artifact for a package though.
So it must be listed as well (within the scope of the list of artifacts).

The definition of a package was not clearly set (for me at least) but the current work on package loader tend to this.
A "source" package is a list of source code artifacts, be that tarballs (.zip, .tar.*, etc...), patches, .dsc, source code (<- insert your techno here) repositories, etc...

Also, this lister will be used by Guix and Nix, but it could also be used by others.

Indeed, well, as a first step, let's try to focus on guix and nix ;)
If adaptations is needed later, we'll do that.

It could provide a quick way for communities to archive their sources with softwareheritage: it's much more simpler to generate and expose a JSON file, ask softwareheritate to setup a lister pointing to this file rather than writing a dedicated lister.

It's becoming rather easy now to add a new lister in charge of adapting the json from the api in question than to ask to adapt the json output...
But i may be wrong.

For instance, Terraform could expose a file containing pointers to the sources of all the modules they provide (see https://registry.terraform.io/).

Indeed, thanks for the pointer.

So, what about sourcesLister, sourceListLister?

That's what other listers do already, they list origins (exposed through some form of apis/websites) which are source code to ingest (dvcs: git, mercurial, svn, ..., plain tarballs: gnu, pypi, npm, debian packages, etc...).
In the end everything this repository lists is source code package in various forms.

Of course, I'd be 100% ok if you want to use FunctionalPackageLister (which could be easily renamed later if needed) :)

Indeed.


Although, when i looked at the guix listing again, the current ouput is different so far.
Will there be work with guix people (@civodul) to align the json output?


By the way, about the version again, are you sure you cannot provide it?
I see it's present in the current guix listing and that would be real swell to have it (instead of trying to parse it).
Experience shows (gnu's...) that's it's not easy to parse without making choices.

By the way, i entertained the idea to write a guix lister with the current api, would that help (that could demo the tests to write as well)?

lewo added a comment.Oct 21 2019, 9:38 AM

By the way, i entertained the idea to write a guix lister with the current api, would that help (that could demo the tests to write as well)?

I think my branch is almost working (i still need to rename the lister). Also, I will try to discuss with some guix people about this format this week.

One thing I have to clarify is the notion of package and artefacts in our case! Actually, we cannot know which artefacts belong to a package. To get the list of sources required by a derivation (kind of Nix package), I recursively walk on all attributes of this derivation to find fixed output derivation (the derivations specifying a source url). Of course, in all of these attribute, I have all build requires of this derivation. This means, for each derivation, I have a tree of sources. It is then really hard to know what is belonging to this derivation and what is not. So, instead of providing this graph, I only provide the list of all sources of this graph.
I think we will only have one artefact per package, otherwise, we will have the gcc artefacts in almost all packages!

lewo updated this revision to Diff 7389.Oct 21 2019, 10:44 AM
lewo edited the summary of this revision. (Show Details)
  • [wip] swh.lister.json: Add lister getting sources from JSON file
  • Rebased

I think my branch is almost working (i still need to rename the lister).

neat ;)

Also, I will try to discuss with some guix people about this format this week.

Cool. Indeed, given the difficulty of this, some feedback loop should be appreciated...

One thing I have to clarify is the notion of package and artefacts in our case!

yes, it's not that easy to grasp in the current nix/guix context ;)

I guess it depends if you decide to do an in-depth walk of the dependency graph or not...
The in-depth sounds like something unreasonable as you will end up having the world of dependencies as artifacts... (say for example, as you mention, down to the compiler's source...)

Actually, we cannot know which artefacts belong to a package. To get the list of sources required by a derivation (kind of Nix package), I recursively walk on all attributes of this derivation to find fixed output derivation (the derivations specifying a source url).
Of course, in all of these attribute, I have all build requires of this derivation. This means, for each derivation, I have a tree of sources. It is then really hard to know what is belonging to this derivation and what is not. So, instead of providing this graph, I only provide the list of all sources of this graph.
I think we will only have one artefact per package, otherwise, we will have the gcc artefacts in almost all packages!

Indeed.

From my rapid look at the the guix listing, it seems some choices were made already as there is not that much of dependencies per package there.
(though nix/guix share the same build approach).
And it seems to be amongst the same trail of thought so that seems reasonable.

Let's just see where your discussion will lead this ;)

Cheers,

Also ci job currently fails for the pep8 violations [1]:

flake8 run-test: commands[0] | /home/jenkins/workspace/DLS/tox/.tox/flake8/bin/python -m flake8
./swh/lister/json/models.py:5:1: F401 'sqlalchemy.Integer' imported but unused
./swh/lister/json/tests/conftest.py:10:1: E302 expected 2 blank lines, found 1

So the tests actually did not even run.
I expect, in the current state, to fail nonetheless because you need to rename your test dataset file (please, check my latest remark ;).

To avoid being avoid by those pep8 violation, you should be able to reproduce those by either using tox (it runs multiple things including flake8 and unittest) or make check at the top-level repository.

[1] https://jenkins.softwareheritage.org/job/DLS/job/tox/458/console

swh/lister/json/tests/data/sources.nixos.org/sources.json
1 ↗(On Diff #7389)

For tests, using the requests_mock_datadir fixture (as seen below), this needs to be renamed to:

data/https_sources.nixos.org/sources.json

Provided the request query done within the lister is:

https://sources.nixos.org/sources.json
zimoun added a subscriber: zimoun.Oct 25 2019, 4:15 PM

Dear,

Since Ludo (@civodul) posted this WIP feature on guix-devel mailing list [1] I am trying to follow this thread and I would like to help.

Currently, what is not clear to me are:

  • the fields of the JSON file. Can we agree on which ones are required?
  • the patches. Do they go in SWH?

Thank for your work.

All the best,
simon


Basically, the Guix definition of a package containing patches is:

(define-public 4store
  (package
    (name "4store")
    (version "1.1.6")
    (source (origin
      (method git-fetch)
      (uri (git-reference
             (url "https://github.com/4store/4store.git")
             (commit (string-append "v" version))))
      (sha256
       (base32 "1kzdfmwpzy64cgqlkcz5v4klwx99w0jk7afckyf7yqbqb4rydmpk"))
      (patches (search-patches "4store-unset-preprocessor-directive.patch"
                               "4store-fix-buildsystem.patch"))))
    (build-system gnu-build-system)
    (blabla blabla)))

and currently https://guix.gnu.org/packages.json exposes:

{
  "name": "4store",
  "version": "1.1.6",
  "source": {
    "type": "git",
    "git_url": "https://github.com/4store/4store.git",
    "git_ref": "v1.1.6"
  },
  "synopsis": "Clustered RDF storage and query engine",
  "homepage": "https://github.com/4store/4store",
  "location": "gnu/packages/databases.scm:134"
}

And this does not exactly matches the format of data/https_sources.nixos.org/sources.json:

  {
    "name": "keyutils-1.6.tar.bz2",
    "source": {
      "hash": "05bi5ja6f3h3kdi7p9dihlqlfrsmi1wh1r2bdgxc0180xh6g5bnk",
      "hashAlgo": "sha256",
      "type": "url",
      "url": "https://people.redhat.com/dhowells/keyutils/keyutils-1.6.tar.bz2",
      "integrity": "sha256-067yDOwABcD6a0vkAHmIVWdHMYWxpXtimwMOZ5QscRU="
    }
}

[1] https://lists.gnu.org/archive/html/guix-devel/2019-09/msg00227.html

lewo added a comment.Oct 26 2019, 10:12 PM

Since Ludo (@civodul) posted this WIP feature on guix-devel mailing list [1] I am trying to follow this thread and I would like to help.

Cool! Welcome aboard ;)

Currently, what is not clear to me are:

  • the fields of the JSON file. Can we agree on which ones are required?

This is something that still needs to be defined and I will publish a detailed comment on that.

  • the patches. Do they go in SWH?

Yes, if we want to be able to rebuild a package from SHW.

lewo added a subscriber: zack.Oct 26 2019, 11:55 PM

I discussed with Ludo and we agreed on the fact the current packages.json file is not really suitable for the SWH usecase.
tl;dr the idea is to expose a list of sources instead of a list of packages.

The objective is to be able to rebuild all packages of nixpkgs (the Nix package set) and the Guix package set even if some sources required by a package disappeared. To achieve that, we need to build the list of these sources, and ingest these sources in SWH.

We were initially exposing a list of packages, but it would be better to expose a list of sources instead. I explain why it is hard in the following.
In nixpkgs, we expose top level packages (gcc, git, emacs,...). These packages can be used as build dependencies by other packages. For instance, we expose gcc which is used by the hello-world package during its build phase. So, if we expose the source of the gcc package and the source of hello-world package, we could rebuild hello-world by fallbacking on SWH. However, nixpkgs is much more complex (and powerful).

Suppose now the latest version of hello-world needs a specific patch on gcc to be compiled. In nixpkgs, we can override the gcc package by providing a patch. To achieve that, we don't create a new version of gcc, we don't expose this gcc as a package. We just override gcc in place, in the hello-world build recipe. If we want to archive all sources required by the`hello-world` package, we would have to archive this patch, which is part of the overridden gcc package:/ In this kind of scenario, it is really hard to add the patch in the list of required sources of the hello-world package without adding all gcc sources.

Let me explain how I generate the list of all sources required by the package set. In Nix, we have two kind of derivations: normal derivations and fixed output derivation. Only the fixed output derivations can access network. These derivations contain the url of sources.
To get sources required by nixpkgs, I traverse the nixpkgs graph to extract all fixed output derivations.
If we consider our previous hello-world example, this means when I traverse the hello-world attribute, i get sources of gcc and the url of the patch. I flatten this graph to build the list of sources.

We could expose a sources.json file containing a list of sources. At the beginning, a source would be an object containing only two fields:

  • type: specifies the type of the source. Currently, only the type url is supported. This field is required.
  • url: specifies the url of the source. This field is required when the type is url.

Here an example of a such list:

[
  {
    "type": "url",
    "url": "https://ftpmirror.gnu.org//hello/hello-2.10.tar.gz"
  },
  {
    "type": "url",
    "url": "https://github.com/curl/curl/commit/5fc28510a4664f4.patch"
  },
]

We could latter add other types (git especially), and some other fields such as

  • an integrity field: optional because we don't always have a usable checksum
  • a name: optional because a patch is not named
  • a version: optional because a patch is generally not versioned
  • ...

@zack You were talking about versioning this file. What about adding a version field attribute to the file such as:

{
  version: 1
  sources: [
    # The list of actual sources 
  ]
}

Note also that having top-level attributes in the file could also be useful for the future to add the nixpkgs commit sha for instance or some other values shared across sources.

An example of this file (only contains all sources of hello-world) is available at http://tmp.abesis.fr/sources.json

What do you think?

civodul added a subscriber: civodul.Nov 5 2019, 3:40 PM

Hi @lewo and all!

I like the spec you've come up with! Like you write, having a JSON file that is "source-oriented" rather than "package-oriented" sounds more appropriate for archiving purposes.

We could expose a sources.json file containing a list of sources. At the beginning, a source would be an object containing only two fields:

  • type: specifies the type of the source. Currently, only the type url is supported. This field is required.
  • url: specifies the url of the source. This field is required when the type is url.

LGTM, though I think we should define the git type right away. For that, we can probably reuse a format similar to that found at https://guix.gnu.org/packages.json, which looks like:

{
    "type": "git",
    "git_url": "https://github.com/pali/0xffff.git",
    "git_ref": "0.8"
}

... where git_ref can be a tag name or a commit ID.

WDYT?

We could latter add other types (git especially), and some other fields such as

  • an integrity field: optional because we don't always have a usable checksum

Sure, that can always come later IMO.

  • a name: optional because a patch is not named
  • a version: optional because a patch is generally not versioned

These would be the name and version of what? Given that this format is "source-oriented", there's no notion of a package, and thus no name and version.

Anyway, if we do find a use for such extensions, I'd say that can come later. :-)

@zack You were talking about versioning this file. What about adding a version field attribute to the file such as:

{
  version: 1
  sources: [
    # The list of actual sources 
  ]
}

That LGTM.

I haven't looked at the implementation, I'm sure fellow SWH hackers will have feedback more useful than I could provide :-). That said, if this format is OK in principle for you @lewo and for SWH, I'm happy to implement it and publish it at guix.gnu.org so we can see what it's like.

Thanks for all the work, @lewo!

lewo added a comment.EditedNov 6 2019, 9:52 PM

LGTM, though I think we should define the git type right away. For that, we can probably reuse a format similar to that found at https://guix.gnu.org/packages.json, which looks like:

{
    "type": "git",
    "git_url": "https://github.com/pali/0xffff.git",
    "git_ref": "0.8"
}

... where git_ref can be a tag name or a commit ID.

WDYT?

The first version of the lister will not support git sources, only tarballs. So, i would prefer to postpone this discussion for a future version of the lister. Note the lister currently removes all sources with a type which is not equal to url (so you could add a git type if you want :/).

I haven't looked at the implementation, I'm sure fellow SWH hackers will have feedback more useful than I could provide :-). That said, if this format is OK in principle for you @lewo and for SWH, I'm happy to implement it and publish it at guix.gnu.org so we can see what it's like.

The implementation is not ready yet but I will work on it on the next following days.
@civodul I could ping you once the lister is working. We could then expose this file on guix.gnu.org and on nixos.org ;) Thx for your comments.

lewo updated this revision to Diff 7683.Nov 6 2019, 9:54 PM
  • wip: switch to the new format
lewo updated this revision to Diff 7800.Nov 13 2019, 9:43 AM
  • Rename JSONLister to FunctionalPackageLister
  • Fix test
  • Cleaning
ardumont added inline comments.Nov 13 2019, 5:06 PM
swh/lister/functional_package/tests/data/sources.nixos.org/sources.json
1 ↗(On Diff #7800)

rename this swh/lister/functional_package/tests/data/https_nixos.org/sources.json.n

for the requests_datadir_fixture to actually find the file (and be able to do its jobs ;).
Also, might be remove some entries, there is no need for so many urls.

For better feedback loop, you can use:

pytest -x --log-level=DEBUG ./swh/lister/functional_package/tests/test_lister.py

or

tox -- -x --log-level=DEBUG -k test_lister_no_page_check_results
lewo updated this revision to Diff 7849.Nov 14 2019, 8:19 PM
  • Move sources.json mock to correct location
lewo marked an inline comment as done.Nov 14 2019, 8:22 PM
lewo added inline comments.
swh/lister/functional_package/tests/data/sources.nixos.org/sources.json
1 ↗(On Diff #7800)

The test was passing locally because I had an older swh-core version. It should now be fixed.

Thanks!

Please, see my latest comments.

swh/lister/functional_package/lister.py
29

Please, drop this duplicated comment and explain a bit what the generated tasks are.

why is there only one artifact per package for example (<- i don't remember the detail, so that has double purposes here ;)

33

So now, this needs to be changed.

The actual loader to use now is swh.loader.core.package.archive.ArchiveLoader.

The task referring to this is:

@shared_task(name=__name__ + '.LoadArchive')
def load_archive(url=None, artifacts=None, identity_artifact_keys=None):
    return ArchiveLoader(url, artifacts,
                         identity_artifact_keys=identity_artifact_keys).load()
...

So your code can change to something like (untested):

return utils.create_task_dict(
            'load-tar', kwargs.get('policy', 'oneshot'),
            url=origin_url,                       # <- prefer to use kwargs instead of args
            artifacts=[{'archive': origin_url}],  # <- only provide what you can
            identity_artifact_keys=['archive'],   # <- unicity key
            retries_left=3)                       # <- that will fail otherwise when actually running

^ then you'd need to adapt the test below.

swh/lister/functional_package/tests/test_lister.py
10

You can remove this.

17

Please remove the print when you are done debugging ;)

25

If you adapt according to my remarks, this need to change as well.

lewo updated this revision to Diff 7968.Nov 21 2019, 12:12 AM
lewo marked an inline comment as done.

Fix ardumont comments

lewo marked 3 inline comments as done.Nov 21 2019, 12:16 AM
lewo added inline comments.
swh/lister/functional_package/lister.py
29

I added a comment.
If you need more details, I explained why the initial package file is not suitable in this context and why we want to expose a file containing a list of sources (instead of packages) in https://forge.softwareheritage.org/D2025#51269.

33

Thanks ;)

ardumont added a comment.EditedNov 21 2019, 2:26 PM

Yes, thanks for the update.

Build has failed
See console output for more information:

https://jenkins.softwareheritage.org/job/DLS/job/tox/494/console

I fixed the ci on latest master.
Something changed in the scheduler (it no longer sets up the loader's task-types the lister generates, thus the current failure here).

Can you please just do the last adaptations?

  • Rebase to latest master
  • Rename 'load-tar' reference to 'load-archive-files'.
  • Then update the diff.

The ci should go back to green

Note: i did the necessary changes to decrease the amount of changes here ;)

Cheers,

swh/lister/functional_package/lister.py
29

Thanks!

gentle ping ;)

lewo updated this revision to Diff 8999.Jan 14 2020, 6:56 PM

Rebase and change load-tar to load-archive-files

lewo added a comment.Jan 14 2020, 6:58 PM

Sorry for the delay... I will be more responsive now.

ardumont added inline comments.Jan 16 2020, 1:09 PM
swh/lister/functional_package/tests/test_lister.py
12

test must be failing because of the old load-tar reference here, if you change that to load-archive-files, you should find what you listed ;)

Sorry for the delay... I will be more responsive now.

no problem ;)

lewo updated this revision to Diff 9099.Jan 16 2020, 7:33 PM

Fix the loader name in the test

ardumont added inline comments.Jan 16 2020, 8:20 PM
swh/lister/functional_package/tests/test_lister.py
16

erf, i missed that one as well :/

lewo updated this revision to Diff 9100.Jan 16 2020, 10:00 PM

And fix another one :/

lewo added inline comments.Jan 16 2020, 10:12 PM
swh/lister/functional_package/tests/test_lister.py
16

héhé, you are not the only one!
I'm actually no longer able to run tests locally. I need to take some time to reset my local setup.

ardumont added inline comments.Jan 17 2020, 8:41 AM
swh/lister/functional_package/tests/test_lister.py
16

Have you tried running tox --recreate (or -r for short)?

Could you please update the title and the description according to the current state?
(i you don't have time, please tell me so i will ;)

lewo retitled this revision from [wip] swh.lister.json: Add lister getting sources from JSON file to [wip] swh.lister.functionalPackages: add lister getting sources from a JSON file.Jan 21 2020, 7:12 PM
lewo edited the summary of this revision. (Show Details)
lewo edited the summary of this revision. (Show Details)Jan 29 2020, 11:33 PM
lewo added a comment.Jan 29 2020, 11:39 PM

A CI job is building a sources.json every day! The file is available at https://nix-community.github.io/nixpkgs-swh/sources.json ;)
This is a community CI (not hosted on main NixOS infrascture) which will allow me to iterate quickly on this file.

If you are going to the FOSDEM, would be nice to meet you there to talk about next steps!

In D2025#61931, @lewo wrote:

A CI job is building a sources.json every day! The file is available at https://nix-community.github.io/nixpkgs-swh/sources.json ;)

Awesome!

If you are going to the FOSDEM, would be nice to meet you there to talk about next steps!

I'm already in Brussels and would be happy to meet!

I can try and get a sources.json generated soon as well.

I guess support for type = "git" will come later, right?

Thank you,
Ludo'.

A CI job is building a sources.json every day! The file is available at https://nix-community.github.io/nixpkgs-swh/sources.json ;)

This is a community CI (not hosted on main NixOS infrascture) which will allow me to iterate quickly on this file.

nice.

If you are going to the FOSDEM, would be nice to meet you there to talk about next steps!

It would but i'm not going.

Most of the team are going though @zack, @douardda and @olasd (they can talk about the next steps as well).

Cheers,

lewo added a comment.Jan 30 2020, 8:53 PM
> If you are going to the FOSDEM, would be nice to meet you there to talk about next steps!

I'm already in Brussels and would be happy to meet!

Cool!

I can try and get a `sources.json` generated soon as well.

I guess support for `type = "git"` will come later, right?

Yes, it will come later. Note your source file can contains git sources,
but these source urls are ignored by the current lister implementation.

See you this WE;)
Antoine.

lewo added a comment.Jan 30 2020, 8:56 PM
> If you are going to the FOSDEM, would be nice to meet you there to talk about next steps!

It would but i'm not going.

Arf!

Most of the team are going though @zack, @douardda and @olasd (they can talk about the next steps as well).

Ok. I will ask on IRC tomorrow.

In D2025#61931, @lewo wrote:

A CI job is building a sources.json every day! The file is available at https://nix-community.github.io/nixpkgs-swh/sources.json ;)

Awesome!

I can try and get a sources.json generated soon as well.

While looking into this with @zimoun, we realized it would be nicer if url were an array of URLs (as is the case at https://guix.gnu.org/packages.json) rather than a single URL.

The reason is that in many cases, both Guix and Nixpkgs provide a list of URLs rather than a single URL, which is useful when one of them breaks.

WDYT? @lewo?

Ludo'.

While looking into this with @zimoun, we realized it would be nicer if url were an array of URLs (as is the case at https://guix.gnu.org/packages.json) rather than a single URL.

The reason is that in many cases, both Guix and Nixpkgs provide a list of URLs rather than a single URL, which is useful when one of them breaks.

For a concrete example of a first mirror failing, in Guix, the package perl-test-deep is mirrored and the first mirror points to http://www.cpan.org/authors/id/R/RJ/RJBS/Test-Deep-1.120.tar.gz which answers download failed "404 Not Found". Then Guix fallbacks to the second one. Therefore, instead of the format:

{
  "type": "url",
   "url": "http://www.cpan.org/authors/id/R/RJ/RJBS/Test-Deep-1.120.tar.gz"
}

we are proposing to go toward:

{
  "type": "url",
   "url": [
       "http://www.cpan.org/authors/id/R/RJ/RJBS/Test-Deep-1.120.tar.gz",
       "http://cpan.metacpan.org/authors/id/R/RJ/RJBS/Test-Deep-1.120.tar.gz"
     ]
}

without changing now the crawler, i.e., the crawler can ingest only the first elem of the array and it will be modified later.

WDYT @lewo?

lewo added a comment.Feb 27 2020, 10:24 PM

While looking into this with @zimoun, we realized it would be nicer if url were an array of URLs (as is the case at https://guix.gnu.org/packages.json) rather than a single URL.

without changing now the crawler, i.e., the crawler can ingest only the first elem of the array and it will be modified later.

Yeah, I was thinking to introduce this later. But as you said, we could still modify the format without supporting it in the lister.
So, that's fine for me if we generate a list of urls instead of a single url. I could easily update the file NixOS is generating.

zimoun added a comment.Mar 2 2020, 5:26 PM

@lewo: Does the version of the format should be bumped to 2 with this string-to-array modification?

Then, in Guix, we recently had an issue because the upstream modified the tarball in-place; breaking guix time-machine [1]. Well, this raises 2 questions:

  1. how to crawl back in time? i.e., all the sources of all the version of the packages that have been included in Guix (at least after v0.15).

Other said, which or:

  • does sources.json is a snapshot of the all sources for one specific state of Guix? and so let generate only big sources.json once including all the previous states, crawls it and then remove it.
  • are all the sources for each state appended to sources.json?
  1. Some tarballs of sources are kept on the Guix build farm, should they be added to the array of URLs?

[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=39575

lewo added a comment.Mar 2 2020, 6:52 PM

> @lewo: Does the version of the format should be bumped to 2 with this string-to-array modification?

No, I don't think so since it is not used yet.

Then, in Guix, we recently had an issue because the upstream modified the tarball in-place; breaking guix time-machine [1]. Well, this raises 2 questions:

  1. how to crawl back in time? i.e., all the sources of all the version of the packages that have been included in Guix (at least after v0.15).

Other said, which or:

  • does sources.json is a snapshot of the all sources for one specific state of Guix? and so let generate only big sources.json once including all the previous states, crawls it and then remove it.
  • are all the sources for each state appended to sources.json?

My plan is to generate one file per Hydra (the NixOS CI) evaluation: six times per day, Hydra (our CI) evaluates nixpkgs (our package repository), builds all packages and publishes them in our binary caches.
I'm planning to generate a sources.json file for each of these evaluations. At this moment, the lister only considers one sources.json file. But in the future, I think it will have to consider several sources.json to be sure it didn't miss any of these evaluations. This could be for instance a kind of RSS stream of sources.json (or something completely different).
If you want to go back in time, I think you could just generate a sources.json file for all Guix evaluations you want to consider. If the source has be in-place upgraded... I don't think we will be able to retrieve it.

So, for the version 1, I think we'd have to generate a sources.json file, containing all sources for a specific evaluation.

  1. Some tarballs of sources are kept on the Guix build farm, should they be added to the array of URLs?

NixOS stores (almost) all tarballs of sources in a S3 content addressable store (https://tarballs.nixos.org). I think I will add it in the list of mirrors.
But at the end, this cache could be superseded by the SWH archive.

zimoun added a comment.Mar 2 2020, 7:57 PM

@lewo: Does the version of the format should be bumped to 2 with this string-to-array modification?

No, I don't think so since it is not used yet.

Ok.

My plan is to generate one file per Hydra (the NixOS CI) evaluation: six times per day, Hydra (our CI) evaluates nixpkgs (our package repository), builds all packages and publishes them in our binary caches.
I'm planning to generate a sources.json file for each of these evaluations. At this moment, the lister only considers one sources.json file. But in the future, I think it will have to consider several sources.json to be sure it didn't miss any of these evaluations. This could be for instance a kind of RSS stream of sources.json (or something completely different).
If you want to go back in time, I think you could just generate a sources.json file for all Guix evaluations you want to consider. If the source has be in-place upgraded... I don't think we will be able to retrieve it.

That makes sense. We could do the same on our side; just to see with @civodul how sources.json will be built, maybe by Cuirass (our CI; roughly a rewrite of Hydra AFAIU) or by the Data Service (roughly a collector of Guix data to ease QA).

So, for the version 1, I think we'd have to generate a sources.json file, containing all sources for a specific evaluation.

Keep you in touch where our sources.json will be available.

Some tarballs of sources are kept on the Guix build farm, should they be added to the array of URLs?

NixOS stores (almost) all tarballs of sources in a S3 content addressable store (https://tarballs.nixos.org). I think I will add it in the list of mirrors.

Ok.

But at the end, this cache could be superseded by the SWH archive.

I agree. The point is to ease the SWH archive feeding. :-)

lewo added a comment.Mar 5 2020, 4:36 PM

After some discussions with the SWH team, it is actually no longer the good way to fill the archive with our sources. Instead, I'm starting to write a loader which will be in charge of reading our sources.json and fill the archive. So, I'm closing this diff and will create a new diff with a loader in the next few days;)
There are also some advantages of implementing a loader: for instance, we could query the SWH API to know which sources of a specific sources.json file have been archived!

Note: it doesn't change anything regarding the Nix and Guix sources.json file: we still have to expose it and the format remains the same.

After some discussions with the SWH team, it is actually no longer the good way to fill the archive with our sources. Instead, I'm starting to write a loader which will be in charge of reading our sources.json and fill the archive. So, I'm closing this diff and will create a new diff with a loader in the next few days;)

@lewo : Let me know the new diff number. :-)

There are also some advantages of implementing a loader: for instance, we could query the SWH API to know which sources of a specific sources.json file have been archived!

Yes, I imagine.

Note: it doesn't change anything regarding the Nix and Guix sources.json file: we still have to expose it and the format remains the same.

I reported [1] the discussion on Guix to generate a sources.json file for each evaluation of the CI. Do you have in mind a scheme to ease the job of the future loader? One sources.json by evaluation but how the loader will know the URL to find this sources.json file? Something else?

[1] https://lists.gnu.org/archive/html/guix-devel/2020-03/msg00030.html

lewo abandoned this revision.Mar 6 2020, 5:40 PM

@lewo : Let me know the new diff number. :-)

Sure.

I reported [1] the discussion on Guix to generate a sources.json file for each evaluation of the CI. Do you have in mind a scheme to ease the job of the future loader? One sources.json by evaluation but how the loader will know the URL to find this sources.json file? Something else?

No, I don't really have an idea on that topic, excepting a RSS stream pointing to evaluation sources files. So, any other ideas are welcomed;)
Also, I don't think we need this for a first version.

In D2025#65063, @lewo wrote:

While looking into this with @zimoun, we realized it would be nicer if url were an array of URLs (as is the case at https://guix.gnu.org/packages.json) rather than a single URL.

without changing now the crawler, i.e., the crawler can ingest only the first elem of the array and it will be modified later.

Yeah, I was thinking to introduce this later. But as you said, we could still modify the format without supporting it in the lister.
So, that's fine for me if we generate a list of urls instead of a single url. I could easily update the file NixOS is generating.

Thanks to @zimoun, https://guix.gnu.org/sources.json is now generated periodically (every hour). Each url is now a list.

lewo added a comment.Mar 9 2020, 2:35 PM

Thanks to @zimoun, https://guix.gnu.org/sources.json is now generated periodically (every hour). Each url is now a list.

Cool! Thanks @zimoun ;)

lewo added a comment.Mar 13 2020, 10:36 AM

@civodul @zimoun I'm wondering if you generate a sources.json file for any commit of your guix repository or only for those that have been evaluated and pushed to your binary cache by your CI?

Hello!

In D2025#67709, @lewo wrote:

@civodul @zimoun I'm wondering if you generate a sources.json file for any commit of your guix repository or only for those that have been evaluated and pushed to your binary cache by your CI?

https://guix.gnu.org/sources.json is built periodically from the tip of the master branch, independently of the CI status.

What's the status of this patch series? Would be great to deploy it. :-)

lewo added a comment.Thu, May 14, 10:43 AM

What's the status of this patch series? Would be great to deploy it. :-)

This patch set is running on staging since ~2 weeks. I hope this patch set will be deployed soon on prod, maybe next week? ;)

https://guix.gnu.org/sources.json is built periodically from the tip of the master branch, independently of the CI status.

Some minor changes will be required to your current sources.json. I'm waiting to have the loader in prod before showing you what to change, in order to avoid any useless changes on your side!