Page MenuHomeSoftware Heritage

[wip] swh.lister.json: Add lister getting sources from JSON file
Needs RevisionPublic

Authored by lewo on Sun, Sep 22, 8:59 PM.

Details

Summary

This JSON lister downloads a JSON file containing a list of
sources to create loader tasks. This input file looks like:

[
  {'name':
     'hello-2.10.tar.gz',
     'source': {
       'type': 'url',
       'integrity': 'sha256-MeBmE3qWJnbon2nRtlOC3pWn732RS4y5VvQepy4PUWs=',
       'url': 'https://ftpmirror.gnu.org//hello/hello-2.10.tar.gz'
     }
  }
]

The integrity attribute is a checksum of the content specified with
the SRI format (see https://www.w3.org/TR/SRI). It is currently used
as an index in the lister JSON table but can be omitted. In this case,
the url is used instead.

This is a work in progress lister and we need to work on several points:

  • define a JSON format
  • expose the JSON file from a NixOS community managed server
  • make a working loader! (wip in D2145) (edited by ardumont: not mandatory to land)

Diff Detail

Repository
rDLS Listers
Branch
json-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 7921
Build 11406: tox-on-jenkinsJenkins
Build 11405: arc lint + arc unit

Event Timeline

lewo created this revision.Sun, Sep 22, 8:59 PM
lewo added a comment.Sun, Sep 22, 9:27 PM

This is a work in progress lister and we need to work on several points:

  • define a JSON format
  • make a working loader!
  • expose the JSON file from a NixOS community managed server
JSON format

The JSON format used by the lister is really close to the one proposed by the Guix community. The only difference is the presence of the hash attribute which contains the hash of the source. This hash is currently used to index sources in the lister DB.

Loader tasks

This lister is able to create loader tasks (load-tar) but these tasks are not well executed.

I tried to create LoadTarRepository tasks but i got too many failures. I then switched to load-tar tasks. These loader tasks successfully load some url content but it's not working well too.
It would be nice to get some help on this part!

I'm testing this lister by running the command:

docker-compose exec swh-lister python3 -c 'import logging; from swh.lister.json.tasks import json_lister; logging.basicConfig(level=logging.DEBUG); json_lister(url="http://tmp.abesis.fr/swh-021d733ea3f87b8c9232020b4e606d08eaca160b.json")'
Expose the JSON file from a NixOS community managed server

I will submit a pull request once we agree on a JSON format.

ardumont added subscribers: douardda, ardumont.EditedMon, Sep 23, 10:48 AM

Hello, nice to see some code (i did not check yet though ;)

make a working loader!

jsyk, @douardda and i are working on the package loader (T1389)
We are currently reimplementing the pypi loader and the gnu one to see if what we did fit ;)
So like i said early on, don't worry too much about that \m/

The only difference is the presence of the hash attribute which contains the hash of the source. This hash is currently used to index sources in the lister DB.

Yes, that'd be great to have some hash(es?) on those artifacts.
That way, we could make some more checks along the way (aside the length ;)

The JSON format used by the lister is really close to the one proposed by the Guix community.

In that regards, what's the hash in question?
Maybe this could be mentioned either in comments or changing the key to be explicit?

Cheers,

lewo updated this revision to Diff 6807.Mon, Sep 23, 9:16 PM

Using integrity attribute instead of hash. The format of this attribute follows the SRI specification.

lewo added a comment.Mon, Sep 23, 9:25 PM

jsyk, @douardda and i are working on the package loader (T1389)

Cool!

We are currently reimplementing the pypi loader and the gnu one to see if what we did fit ;)

Let me know when/if i can do some tests with it :)

Yes, that'd be great to have some hash(es?) on those artifacts.

We currently only have one hash and i don't think this will change soon.

That way, we could make some more checks along the way (aside the length ;)

Yep, it would be nice to be sure what is ingested by SWH is what we are expecting :) But, this is out of the scope of this PR I think.

In that regards, what's the hash in question?
Maybe this could be mentioned either in comments or changing the key to be explicit?

I just changed the implementation:

  • the name of the attribute is integrity
  • it takes a SRI (https://www.w3.org/TR/SRI)
  • it is actually optional: if not provided, the url is used for the table index.
lewo edited the summary of this revision. (Show Details)Mon, Sep 23, 9:27 PM
lewo edited the summary of this revision. (Show Details)Mon, Sep 23, 9:36 PM
lewo edited the summary of this revision. (Show Details)

I just changed the implementation:
the name of the attribute is integrity
it takes a SRI (https://www.w3.org/TR/SRI)
it is actually optional: if not provided, the url is used for the table index.

Awesome, thanks for the heads up.

make a working loader! (wip in https://forge.softwareheritage.org/T1389)

To be clear and repeat what i said in other media, this is not in the diff's scope.

define a JSON format

Who needs to do that?
What's missing right now?

expose the JSON file from a NixOS community managed server

Indeed.


Just so you know, the typical blockers i foresee are the following:

  • you need to sign the L3 document [1]
  • add some more tests (typically having some api sample files (when that's settled upon). Check the cgit lister for some example.

[1] https://forge.softwareheritage.org/L3

Cheers,

swh/lister/json/lister.py
14

Something more specific will be needed.
JSON is too generic and wrong as other apis can also list json.

FunctionalPackageManagerLister?
FunctionalPackageLister?

In my head, only guix and nix qualifies for it so far (and they will have something sufficiently near IIUC).

70

From the mailing list discussion, i recall it's not only tar origins we could list.

ardumont requested changes to this revision.Fri, Oct 11, 6:27 PM

Needs a rebase.

Also now, you can implement at least one real integration test to ensure everything is as you want.

Check for example the gnu lister's test [1]

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gnu/tests/test_lister.py$0-12

swh/lister/cli.py
131

This no longer exists.
After the rebase, add an entry in the setup.py [1]
And everything should be ok

Something like:

# assuming you change from `json` to something more generic, `functionalpackage` (or something)
lister.functionalpackage=swh.lister.functionalpackage:register

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/setup.py$56-67

swh/lister/json/lister.py
73

After rebasing, this can go away now (base implementation is implemented that way).

This revision now requires changes to proceed.Fri, Oct 11, 6:27 PM
ardumont edited the summary of this revision. (Show Details)Fri, Oct 11, 6:28 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)
ardumont added a project: Lister.
ardumont edited the summary of this revision. (Show Details)
ardumont added inline comments.Tue, Oct 15, 5:21 PM
swh/lister/json/lister.py
32

You could drop the package's name i guess.

33

I guess we could:

  • drop the package's name (it's mostly unused in other listers and i'm dropping when i can)
  • use the named parameter instead (it's clearer, also when introspecting the scheduler db)
  • rename tarballs to packages.

About the integrity field, i guess we can split it to explicitely name it with its hash...
So something like this would do:

packages = build_packages(...)  # < to clear things up a bit
# where packages is of the form:
# {{'uri': origin_url, 'date': <some-date-isoformat>, "sha256": "MeBmE3qWJnbon2nRtlOC3pWn732RS4y5VvQepy4PUWs="}]

return utils.create_task_dict(
            'load-tar', kwargs.get('policy', 'oneshot'),
            origin=origin_url,
            packages=packages)

@douardda ^ what do you think?

ardumont added inline comments.Tue, Oct 15, 5:26 PM
swh/lister/json/lister.py
33

If we could have the version of the package also (within the packages's entries), that'd be awesome.

ardumont added inline comments.Tue, Oct 15, 5:44 PM
swh/lister/json/lister.py
33

Also for the hash, i mean the base64 decoded value as ascii string...

so packages really becomes:

{{'uri': origin_url, 'date': <some-date-isoformat>, "sha256": "31e066137a962676e89f69d1b65382de95a7ef7d914b8cb956f41ea72e0f516b"}]

That unifies with other existing lister output and loader expectations.

The following will help:

from typing import Tuple


def integrity_to_hash(integrity_value: str) -> Tuple[str, str]:
    """Parse an integrity field into a field (hash_name, hash_hex) [1] 

   [1] https://www.w3.org/TR/SRI

    """
    hash_name, base64_value = integrity_value.split('-')

    from base64 import b64decode
    from binascii import hexlify

    hash_hex = hexlify(b64decode(base64_value)).decode('utf-8')
    return hash_name, hash_hex


def test_integrity_to_hash():
    """Parsing an integrity field hash should return a tuple hash_name, hash_hex strings

    """
    actual_hash_name, actual_hash_hex = integrity_to_hash(
        'sha256-MeBmE3qWJnbon2nRtlOC3pWn732RS4y5VvQepy4PUWs=')

    assert hash_name == 'sha256'
    assert hash_hex == '31e066137a962676e89f69d1b65382de95a7ef7d914b8cb956f41ea72e0f516b'  # noqa
lewo marked an inline comment as done.Tue, Oct 15, 6:26 PM
lewo added inline comments.
swh/lister/json/lister.py
33

Unfortunately, it's difficult to get the version.
In nixpkgs, we basically have this kind of structure:

packages = [
  hello = [
    name = "hello"
    version = "1.0"
    buildRecipe = "make"
    src = {
       url = "http://gnu/hello-1.0.tgz"
       sha = "bla"
    }
  ]

So, the src attribute is not versioned. The version is on the package level.
And this can be much more complex, one package can use several patches, and several sources.

Just one note: the name is IMHO inappropriate. This is NOT a JSON lister. JSON is nothing but the serialization format used to retrieve some (more or less) well defined structured data.

What defines this lister is its ability to comprehend the very data structure mentioned in this diff's description.

The question is: what data model does this implement? Is it Guix (only)? is it somewhat standardized?

ardumont edited the summary of this revision. (Show Details)Wed, Oct 16, 10:11 AM
ardumont added inline comments.Wed, Oct 16, 10:20 AM
swh/lister/json/lister.py
33

Right.

Nonetheless, we could have that lister parse the version from what's provided (here the url then).

Developing the loader tar (D2145) raises interesting questions about the gnu loader [1] and the new one.
Aside the version parsing logic, their implementation is near identical.
This begs the question whether we should push that parsing logic step here in listers and pass along that information to the loader (D2145's description explains the rationale).

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/package-loader/swh/loader/package/gnu.py$111-187

To develop further, i guess we need some more dataset sample though ;)

ardumont added a comment.EditedWed, Oct 16, 10:27 AM

Just one note: the name is IMHO inappropriate. This is NOT a JSON lister. JSON is nothing but the serialization format used to retrieve some (more or less) well defined structured data.
What defines this lister is its ability to comprehend the very data structure mentioned in this diff's description.

Indeed, i proposed a name in D2025#inline-13301 ;)

The question is: what data model does this implement? Is it Guix (only)? is it somewhat standardized?

My understanding of the swh-devel mailing-list discussion[1], standardization in between nix/guix is is the way forward.

And to repeat an interesting thing i forgot, guix already exposes its listing [2]

[1] https://sympa.inria.fr/sympa/arc/swh-devel/2019-10/msg00000.html

[2] https://guix.gnu.org/packages.json

lewo added a comment.Thu, Oct 17, 9:41 AM

@ardumont @douardda Regarding the lister name, I agree the JSONLister name is not appropriate. @ardumont proposed FunctionalPackageManagerLister or FunctionalPackageLister but I'm not sure they are appropriated too :)

The file ingested by this lister already contains different kind of elements:

If the goal is to provide reproducibility for some package managers, in the future, this could also contains git revisions, or Docker images, png images...

I don't know what is the/your definition of "package", but if we agree on "a patch is not a package", then this lister is more than a "package" lister!

Also, this lister will be used by Guix and Nix, but it could also be used by others. It could provide a quick way for communities to archive their sources with softwareheritage: it's much more simpler to generate and expose a JSON file, ask softwareheritate to setup a lister pointing to this file rather than writing a dedicated lister. For instance, Terraform could expose a file containing pointers to the sources of all the modules they provide (see https://registry.terraform.io/).

So, what about sourcesLister, sourceListLister?

Of course, I'd be 100% ok if you want to use FunctionalPackageLister (which could be easily renamed later if needed) :)

@ardumont @douardda Regarding the lister name, I agree the JSONLister name is not appropriate. @ardumont proposed FunctionalPackageManagerLister or FunctionalPackageLister but I'm not sure they are appropriated too :)

Well, i was trying to make stand-out the common (and awesome) nature of both guix and nix package managers ;)

The file ingested by this lister already contains different kind of elements:

Yes, and it's fine.
It's source code forms.
It should be listed amongst the artifacts for the origin in question.

If the goal is to provide reproducibility for some package managers, in the future, this could also contains git revisions, or Docker images, png images...

The goal is archiving whatever is source code past, present and future.
What we can do further down the line can be reproducibility among other things.
But that would be the swh clients to determine that (using most probably some swh mirrors yet to come ;)

And sure, the lister can scheduled 'git' type origins already.
Or other kind of loading task types by the way (be that: git, svn, hg, debian, deposit, tar, npm, pypi ...).
Actually, there is at least 2 listers which do that already (bitbucket lists hg and git repositories, packagist should be able to list different dvcs as well).

I don't know what is the/your definition of "package", but if we agree on "a patch is not a package", then this lister is more than a "package" lister!

A patch is not a package indeed.
It's a source code artifact for a package though.
So it must be listed as well (within the scope of the list of artifacts).

The definition of a package was not clearly set (for me at least) but the current work on package loader tend to this.
A "source" package is a list of source code artifacts, be that tarballs (.zip, .tar.*, etc...), patches, .dsc, source code (<- insert your techno here) repositories, etc...

Also, this lister will be used by Guix and Nix, but it could also be used by others.

Indeed, well, as a first step, let's try to focus on guix and nix ;)
If adaptations is needed later, we'll do that.

It could provide a quick way for communities to archive their sources with softwareheritage: it's much more simpler to generate and expose a JSON file, ask softwareheritate to setup a lister pointing to this file rather than writing a dedicated lister.

It's becoming rather easy now to add a new lister in charge of adapting the json from the api in question than to ask to adapt the json output...
But i may be wrong.

For instance, Terraform could expose a file containing pointers to the sources of all the modules they provide (see https://registry.terraform.io/).

Indeed, thanks for the pointer.

So, what about sourcesLister, sourceListLister?

That's what other listers do already, they list origins (exposed through some form of apis/websites) which are source code to ingest (dvcs: git, mercurial, svn, ..., plain tarballs: gnu, pypi, npm, debian packages, etc...).
In the end everything this repository lists is source code package in various forms.

Of course, I'd be 100% ok if you want to use FunctionalPackageLister (which could be easily renamed later if needed) :)

Indeed.


Although, when i looked at the guix listing again, the current ouput is different so far.
Will there be work with guix people (@civodul) to align the json output?


By the way, about the version again, are you sure you cannot provide it?
I see it's present in the current guix listing and that would be real swell to have it (instead of trying to parse it).
Experience shows (gnu's...) that's it's not easy to parse without making choices.

By the way, i entertained the idea to write a guix lister with the current api, would that help (that could demo the tests to write as well)?