Page MenuHomeSoftware Heritage

Add Content Loader to ingest raw content file
ClosedPublic

Authored by ardumont on Sep 29 2022, 4:21 PM.

Details

Summary

In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
This commit adds a loader to ingest those. The output of the ingestion is a snapshot
with 1 branch, one HEAD branch targetting the file content ingested.

This expects to receive a mandatory 'integrity' field. It is used to check the content
match the declaration.

This can also optionally receive a list of mirror urls in case the main origin url is no
longer available. Those mirror urls are solely used as fallback to retrieve the content.

Note: For the integrity field, some future adaptations will be needed in that code.
It's kept out of the scope of this diff to avoid depending on a new release
of the model [1]

Related to T3781
Supersedes D8406

[1] D8582

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8581 (id=30956)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add Content Loader to ingest raw content file
Changes applied before test
commit 75e8a22f220083d9d4a3c1341ed5d882849f7b86
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive files to ingest.
    This creates a loader to ingest those. The output of the ingestion is a snapshot with 2
    branches, one targetting the file ingested whose branch name is the filename. The other
    is an alias branch (matching what's done in other package loader).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also receive a list of mirror urls in case the main origin url is no longer
    available.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/914/ for more details.

please document which terminal of the grammar you are aiming for. (the current implementation is hash-expression, but I don't know if that's intentional)

Could you use a smaller test file? That one is really big...

swh/loader/core/loader.py
655–661

why not like this?

671

might work

688

And why .lower()? The grammar in the spec is:

hash-algo         = "sha256" / "sha384" / "sha512"
700

I don't understand what this means

708

Doesn't work with trailing slashes:

>>> os.path.basename("https://sh.rustup.rs/")
''
712–715
717

that's the right one

723

please document which terminal of the grammar you are aiming for. (the current implementation is hash-expression, but I don't know if that's intentional)

It is intentional, yes.
I'll add a link to the grammar.

Could you use a smaller test file? That one is really big...

Well, that's the sole one i found.

swh/loader/core/loader.py
688

sure, i recalled having seen some in upper case. simpler if no need for it.

Could you use a smaller test file? That one is really big...

Well, that's the sole one i found.

any file would do, you don't need to get one from Guix/Nix

olasd added inline comments.
swh/loader/core/loader.py
708

This makes me think that we should just keep the full URI as origin, and just have the snapshot HEAD point directly to the content object.

Could you use a smaller test file? That one is really big...

Well, that's the sole one i found.

any file would do, you don't need to get one from Guix/Nix

heh, right.

ardumont edited the summary of this revision. (Show Details)
ardumont marked 5 inline comments as done.

Address review

swh/loader/core/loader.py
655–661

+1

655–661

+1

708

done as olasd suggested here and dropped it.

Build has FAILED

Patch application report for D8581 (id=30978)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add Content Loader to ingest raw content file
Changes applied before test
commit 26d3ad52aa8c6e1223d4d0b0e3609c198bf46c7b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/916/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/916/console

Fix docstring which failed the build!

Build is green

Patch application report for D8581 (id=30979)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add Content Loader to ingest raw content file
Changes applied before test
commit 2aca780a73de24ecf7ff9227e43513acb0fb0357
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/917/ for more details.

This revision is now accepted and ready to land.Sep 29 2022, 7:10 PM

@vlorentz I should have started with this... from the nixguix manifest, the integrity is for now only sha256... [1]
So not sure we need to touch the model after all [2], especially since that diff got a tad bigger since you reviewed it...

[2] D8582

 swh  tony  yavin4  ~  work  …  swh  swh-environment  swh-model   master  3⬆  %  jq . /var/tmp/sources.json | grep -c sha256
13629
 swh  tony  yavin4  ~  work  …  swh  swh-environment  swh-model   master  3⬆  %  jq . /var/tmp/sources.json | grep -c sha384
0
 swh  tony  yavin4  ~  work  …  swh  swh-environment  swh-model   master  3⬆  ERROR  %  jq . /var/tmp/sources.json | grep -c sha512
0

@vlorentz I should have started with this... from the nixguix manifest, the integrity is for now only sha256... [1]
So not sure we need to touch the model after all [2], especially since that diff got a tad bigger since you reviewed it...

[2] D8582

$ jq . /var/tmp/sources.json | grep -c sha256
13629
$ jq . /var/tmp/sources.json | grep -c sha384
0
$ jq . /var/tmp/sources.json | grep -c sha512
0

Although sha512 is used in the nixpkgs manifest...

$ jq . /var/tmp/sources-unstable.json | grep -c sha256
58036
$ jq . /var/tmp/sources-unstable.json | grep -c sha384
0
$ jq . /var/tmp/sources-unstable.json | grep -c sha512
8162

Compute expected checksum to check integrity outside the loop

Build is green

Patch application report for D8581 (id=30983)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add Content Loader to ingest raw content file
Changes applied before test
commit 6436e2304d37812839870562f447895768d4c4a5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/919/ for more details.

Build is green

Patch application report for D8581 (id=30984)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add Content Loader to ingest raw content file
Changes applied before test
commit 32524ef0c03e677dbd60ec9d7aec7626c4a5322d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/920/ for more details.

swh/loader/core/loader.py
671

it works!

Build is green

Patch application report for D8581 (id=30992)

Rebasing onto 6299c091ec...

Current branch diff-target is up to date.
Changes applied before test
commit f774aba59e65bd3e5dd0ba9364840d8903d5706c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 16:14:43 2022 +0200

    Add Content Loader to ingest raw content file
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
    This commit adds a loader to ingest those. The output of the ingestion is a snapshot
    with 1 branch, one HEAD branch targetting the file content ingested.
    
    This expects to receive a mandatory 'integrity' field. It is used to check the content
    match the declaration.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the content.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/925/ for more details.