Page MenuHomeSoftware Heritage

swh.loader.gnu: Implement gnu loader
AbandonedPublic

Authored by ardumont on Jul 11 2019, 6:13 PM.

Details

Reviewers
nahimilega
Group Reviewers
Reviewers
Summary

Implemented gnu loader using the base loader
Depends on D1744

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
gnu
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 7016
Build 9874: tox-on-jenkinsJenkins
Build 9873: arc lint + arc unit

Event Timeline

nahimilega edited the summary of this revision. (Show Details)
nahimilega edited the summary of this revision. (Show Details)
  • swh.loader.gnu: Add tests
swh/loader/gnu/tests/test_loader.py
44–82

These hashes are not updated for gnu loader. I found the swh identify method. But I didn't understand what exactly these hashes represent. I mean that swh identify tool return only one hash whereas we have a list of hashes here.
Can you please explain what the hashes in _expected_new_contents_first_visit, _expected_new_directories_first_visit, _expected_new_revisions_first_visit mean in a bit detaild way.
Is there any documentation for the reference?
(if no then I suppose we would need it)

swh/loader/gnu/tests/test_loader.py
44–82

These hashes are not updated for gnu loader.

You need to have the correct hashes according to the archive's tree structure (raw file as content, directories for tree, etc...) you ingest.

I found the swh identify method. But I didn't understand what exactly these hashes represent.

That's a tad problematic since that's what the loader is all about ;)

I mean that swh identify tool return only one hash whereas we have a list of hashes here.

Yes, because each hash represents the hash (sha1 or sha1_git, don't remember which) of the raw content of the files found in the tarballs.

Can you please explain what the hashes in _expected_new_contents_first_visit, _expected_new_directories_first_visit, _expected_new_revisions_first_visit mean in a bit detaild way.

Aren't you the author of this? (I don't understand the question ;')
If no, where does this come from then?

The names seems to be explicit enough.
The first visit is the first time you try to load the origin (the gnu tarball tried to be ingested in the archive is fetched from the origin url as input)...

Indeed, reading back the documentation (persistent identifier and data model) would be a good hint (see below).

Is there any documentation for the reference?
(if no then I suppose we would need it)

There is:

  • identifier: [1]
  • data model: [2]
  • the code in swh-model (starting from swh-identify following it through down to the identifiers/hashutil modules).

[1] https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers

[2] https://docs.softwareheritage.org/devel/swh-model/data-model.html#data-model

anlambert added inline comments.
swh/loader/gnu/loader.py
15

Here you should add these:

CONFIG_BASE_FILENAME = 'loader/gnu'
ADDITIONAL_CONFIG = {
  'temp_directory': ('str', '/tmp/swh.loader.gnu/'),
  'debug': ('bool', False)
}

in order to declare the config filename and some default config entries for the loader.

20
swh/loader/gnu/tasks.py
12

You need to use keyword arguments here:

return GNULoader().load(name=name, origin_url=origin_url, tarballs=tarballs)
swh/loader/gnu/loader.py
15

This was present in base loader class itself, it would assign config filename and some default config entries on the bases of loader name. but after a discussion with @ardumont in IRC and via these comments(https://forge.softwareheritage.org/D1694?id=5884#inline-11570 ) it was removed.
There are three things what we can do

  1. add it again in the base loader
  2. add them separately for each loader
  3. Discard these config filenames and some default config entries completely and use environment variable stanza(as discussed in comment)

What should I do?

Please either abandon the diff or mark this as planned changes like the others ;)

ardumont abandoned this revision.
ardumont added a reviewer: nahimilega.