Page MenuHomeSoftware Heritage

Add Directory Loader to allow tarball ingestion as Directory
ClosedPublic

Authored by ardumont on Sep 30 2022, 11:56 AM.

Details

Summary

In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
ingest. This commit adds a loader to ingest those. The output of the ingestion is a
snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
within the tarball).

This expects to receive a mandatory 'integrity' field. It is used to check the tarball
received out of the origin.

This can also optionally receive a list of mirror urls in case the main origin url is no
longer available. Those mirror urls are solely used as fallback to retrieve the tarball.

Related to T3781
Depends on D8581

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont retitled this revision from Add Directory Loader to ingest raw tarball to Add Directory Loader to ingest tarball as directory.
ardumont retitled this revision from Add Directory Loader to ingest tarball as directory to Add Directory Loader to allow tarball ingestion as Directory.Sep 30 2022, 11:59 AM

Amend commit message and fix typo

Build has FAILED

Patch application report for D8584 (id=30988)

Could not rebase; Attempt merge onto 6299c091ec...

Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                            |   2 +-
 swh/loader/core/loader.py                          | 241 ++++++++++++++++++++-
 .../project_asdf_archives_asdf-3.3.5.lisp          |   1 +
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 204 ++++++++++++++++-
 5 files changed, 440 insertions(+), 8 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_common-lisp.net/project_asdf_archives_asdf-3.3.5.lisp
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit 10076b690145ee3c306e923eea29b5ede907da57
Merge: 6299c09 12da8df
Author: Jenkins user <jenkins@localhost>
Date:   Fri Sep 30 09:56:36 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 12da8df8ee7277b9c208fdd282be92c87cb70a2e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to ingest raw tarball
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

commit c5fcf4025bb878df9541bee1e8c55006ba1874df
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/922/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/922/console

Build has FAILED

Patch application report for D8584 (id=30989)

Could not rebase; Attempt merge onto 6299c091ec...

Merge made by the 'recursive' strategy.
 .pre-commit-config.yaml                            |   2 +-
 swh/loader/core/loader.py                          | 240 ++++++++++++++++++++-
 .../project_asdf_archives_asdf-3.3.5.lisp          |   1 +
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 204 +++++++++++++++++-
 5 files changed, 439 insertions(+), 8 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_common-lisp.net/project_asdf_archives_asdf-3.3.5.lisp
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit 3de188cb01ed3e21492491bb207da019b20b5742
Merge: 6299c09 628efbf
Author: Jenkins user <jenkins@localhost>
Date:   Fri Sep 30 09:59:58 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 628efbf0d9502a45acd55c49a69f1251ac093c06
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

commit c5fcf4025bb878df9541bee1e8c55006ba1874df
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/923/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/923/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 30 2022, 12:04 PM
Harbormaster failed remote builds in B31959: Diff 30989!

ah, ok, pytest is happy here 'cause i got the other diff i abandonned on model...
But the current version is:

DEBUG    swh.loader.core.loader.DirectoryLoader:loader.py:818 Error: Unexpected hashing algorithm sha512, expected one of blake2b512, blake2s256, md5, sha1, sha1_git, sha256

Rebase and make tests ok

I had some local abandonned patch which made my local tests ok... (around swh.model)

Build is green

Patch application report for D8584 (id=30991)

Could not rebase; Attempt merge onto 6299c091ec...

Updating 6299c09..4eaa99e
Fast-forward
 .pre-commit-config.yaml                            |   2 +-
 swh/loader/core/loader.py                          | 236 ++++++++++++++++++++-
 .../project_asdf_archives_asdf-3.3.5.lisp          |   1 +
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 204 +++++++++++++++++-
 5 files changed, 435 insertions(+), 8 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_common-lisp.net/project_asdf_archives_asdf-3.3.5.lisp
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit 4eaa99ea751f49d5453dbb51e2361f9d070d3dd8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

commit f774aba59e65bd3e5dd0ba9364840d8903d5706c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/924/ for more details.

Build is green

Patch application report for D8584 (id=30998)

Rebasing onto f774aba59e...

Current branch diff-target is up to date.
Changes applied before test
commit 497f74f3225e4ccf11adce0d6a2bb50b2a471fab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/926/ for more details.

anlambert added inline comments.
swh/loader/core/loader.py
658–693

Why not simply pass a dict with hashing algorithms as keys and checksums as values ?
This would make the loader more generic and allow to check multiple checksums for
a single content/directory.

swh/loader/core/loader.py
658–693

That's what we have for now in the nixguix manifests, the "integrity" field.

See [1] and the linked draft [2]

[1] T3781

[2] https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw

swh/loader/core/loader.py
658–693

Yes but I mean you could do the integrity string decoding in the nixguix lister instead.

It we want to reuse the directory loader for other use cases, we will have to generate an integrity
string ourselves instead of simply passing a dict of hashes, which is quite cumbersome imho.

swh/loader/core/loader.py
658–693

yes, thinking more about your first question...

We can transform the output of the new nixguix lister [1] to have a dict on checksums to check indeed.

I'm not too fan of the lister doing the check though (it does not retrieve the tarballs when it lists).

I'll do some adaptations to do as you said after this lands though (the content loader already landed and is a similar implementation as this one... plus stack of diffs...).

[1] D8341

swh/loader/core/loader.py
658–693

I'm not too fan of the lister doing the check though (it does not retrieve the tarballs when it lists).

Of course, this is the loader who have to perform that check, I was just talking about extracting the
hash algo and the checksum in the lister to pass it as parameter of the loader.

swh/loader/core/loader.py
658–693

oh yeah, i misunderstood! Neat idea!

vlorentz added inline comments.
swh/loader/core/loader.py
658–693

you could reuse the original-artefacts format, while you're at it! ;)

658–693

hmm actually no, it would only be its checksums field

swh/loader/core/loader.py
658–693

yes, ok, that makes sense.

But i'd rather simplify this in one diff which touch both directory and content loader in a future diff.

swh/loader/core/loader.py
658–693

fwiw, current nixguix lister diff [1] is now sending 'checkums' instead of the integrity field.

I'm gonna attend to adapting the new loaders to read that early next week.

[1] D8341

swh/loader/core/loader.py
658–693

The Directory and Content loader are now dealing with "checksums" instead of "integrity" in the subsequent (amended) diff [2] built on the current loader implementations (including this one).

[2] D8587

This revision is now accepted and ready to land.Oct 3 2022, 11:49 AM

Build is green

Patch application report for D8584 (id=31056)

Rebasing onto 5482a48ea1...

Current branch diff-target is up to date.
Changes applied before test
commit dbf7f3dca0c8c2b9c364bdcdf19481ecf8421b77
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/941/ for more details.