Details

Reviewers

Group Reviewers

Commits

rDLDBASE0ff6cdedf0fc: nixguix: add the integrity attribute in release metadata

Summary

Snapshot branch names were the url of sources. This url is used for
incremental loading: if an url is already existing in the last
snapshot, this url is not downloaded again.

Using the url as key for this cache has several drawbacks:

if the upstream source is upgraded in place, the snapshot will reuse the previous version of the upstream source.
when the loader supports url mirrors, using the url as branch name will be confusing (we won't know which url to use)

Moreover, this integrity attribute will be useful to get artifacts
from SWH once a metadata API will be available.

Related to T1991

Diff Detail

Repository

rDLDBASE Generic VCS/Package Loader

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

lewo created this revision.Mar 27 2020, 3:52 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 27 2020, 3:52 PM

lewo added a reviewer: ardumont.Mar 27 2020, 3:53 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/392/ for more details.

Harbormaster completed remote builds in B11420: Diff 10341.Mar 27 2020, 3:54 PM

nixguix: rename the test file

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/393/ for more details.

Harbormaster completed remote builds in B11432: Diff 10353.Mar 27 2020, 4:40 PM

ardumont edited the summary of this revision. (Show Details)Mar 27 2020, 5:56 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/394/ for more details.

I'm not too sure about making the branch name be an opaque sha256. It's not great from a "end user browsing the archive" point of view.

The approach we've taken with other loaders that have upstream-provided integrity information (e.g. PyPI) is a bit different:

we use human-readable version numbers as branch names
we store the hashes of the downloaded artifacts in the revision metadata

When we download an origin again

we retrieve the list of hashes from upstream
we retrieve all revisions from the previous snapshot
we compare the hashes between upstream metadata and already loaded revisions
we populate the (new) snapshot with objects from the previous snapshot which match the current upstream metadata
we only download artifacts for which the upstream hash doesn't match anything we have loaded so far

This keeps the snapshot "human-readable", while avoiding multiple downloads of stuff we know is (should be) the same between runs.

The main drawback of that approach is that you can't easily query the revision by hash; For now, this is something that you'd need to do in a specific cache. Querying objects by arbitrary metadata is something that we want to support eventually, but we're not there yet.

I think this also solves the "mirrors" problem: once you've downloaded an artifact with the expected hash, you can just have as many branches as you want pointing to that same object. This is what happens in the Debian loader: if several suites have the same version of a package (identified by the set of hashes of the metadata files), we load that version once, then have several pointers to it in our snapshot.

Build is green

I triggered a build back to check whether the great new improvments that @olasd installed would have worked on the diff.
That did not unfortunately ;)

Just to mention that technically, we usually use something like [1] to define the identity of an artifact.
Which could be the same here.

I think the mechanism used in the archive loader works because it only deals with artifact whose metadata are stored in the revision though.
As the metadata is not stored in the revision for all files this loader will ingest (patch, gem, etc...).
There is no comparison possible on next visits. So that will be limited though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/archive/loader.py$95-96

In D2907#70414, @ardumont wrote:

Build is green

I triggered a build back to check whether the great new improvments that @olasd installed would have worked on the diff.
That did not unfortunately ;)

The changes I introduced to CI need patch submitters to run arc diff --update so their changes are pushed to the staging area. Triggering the old job again won't do that.

Just to mention that technically, we usually use something like [1] to define the identity of an artifact.
Which could be the same here.

I think the mechanism used in the archive loader works because it only deals with artifact whose metadata are stored in the revision though.
As the metadata is not stored in the revision for all files this loader will ingest (patch, gem, etc...).
There is no comparison possible on next visits. So that will be limited though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/archive/loader.py$95-96

I'm not sure that we should be running the "load arbitrary files" part of this loader in production until we have a way to store arbitrary metadata on release objects. Once we have that then all objects can be handled the same, I suppose.

Thanks for your review. I'm applying your suggestion which looks nice.

Querying objects by arbitrary metadata is something that we want to support eventually, but we're not there yet.

Hm, this feature would be really nice to have in my context ;)

I think this also solves the "mirrors" problem: once you've downloaded an artifact with the expected hash, you can just have as many branches as you want pointing to that same object. This is what happens in the Debian loader: if several suites have the same version of a package (identified by the set of hashes of the metadata files), we load that version once, then have several pointers to it in our snapshot.

Yes, it solves it.

No longer use the integrity as branch name but the url instead.

Build is green

Patch application report for D2907 (id=10376)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.

Changes applied before test

commit e73873b9b4953af916c375ad61fdc9b87bdfd588
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/3/ for more details.

Harbormaster completed remote builds in B11451: Diff 10376.Mar 30 2020, 12:38 PM

Reduce commit diff :/

Build is green

Patch application report for D2907 (id=10377)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.

Changes applied before test

commit ae41eaa51da65031a0e8ab7cea213f03c5d42879
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/4/ for more details.

Harbormaster completed remote builds in B11452: Diff 10377.Mar 30 2020, 12:41 PM

@olasd Another drawback is I would need to make one API call per release in order to list all integrity attributes of a snapshot. Even with a cache on the client side, the first run would require ~15000 calls to the SWH API. Do you have a rate limiting on these API endpoints?

The implementation looks fine to me.

For the sake of checking the following (please ;)

I can't shake the feeling something is missing in tests.
All the more reasons because the tooling agrees.

Particularly around the incremental visit approach.
The most appropriate test to check seems to be the test_loader_incremental.
Can you check the 2 loader instanciations done there is correct?
The test seems to make the test pass for the wrong reason (thus the code coverage miss).

I think removing this double instanciation will make the test pass with the right coverage this time.

Cheers,

swh/loader/package/nixguix/loader.py
82	I would expect the coverage to currently be green here...
swh/loader/package/nixguix/tests/test_functional.py
90–91	Why do we use 2 loaders again? Can you please remove it and see what happens?

This revision now requires changes to proceed.Mar 31 2020, 11:38 AM

The changes I introduced to CI need patch submitters to run arc diff --update so their changes are pushed to the staging area. Triggering the old job again won't do that.

thanks by the way ;)

Remove the second loader instanciation in test_loader_incremental.

lewo marked 2 inline comments as done.Mar 31 2020, 12:01 PM

Build is green

Patch application report for D2907 (id=10409)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.

Changes applied before test

commit 0ff6cdedf0fcf89e4e2e8dc331fdb46aefb41be0
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/5/ for more details.

Harbormaster completed remote builds in B11483: Diff 10409.Mar 31 2020, 12:01 PM

lewo added inline comments.Mar 31 2020, 12:12 PM

swh/loader/package/nixguix/loader.py
82	I locally put a breakpoint line 82 and the breakpoint has been reached. So, this code is executed by the test `test_loader_incremental`. It seems to be a false positive of the coverage tooling. I don't know what I could do then.

ok then.

This revision is now accepted and ready to land.Mar 31 2020, 1:09 PM

Closed by commit rDLDBASE0ff6cdedf0fc: nixguix: add the integrity attribute in release metadata (authored by lewo). · Explain WhyApr 2 2020, 10:41 AM

This revision was automatically updated to reflect the committed changes.

lewo added a commit: rDLDBASE0ff6cdedf0fc: nixguix: add the integrity attribute in release metadata.

nixguix: use the integrity attribute as snapshot branch name
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D2907 (id=10376)

Changes applied before test

Patch application report for D2907 (id=10377)

Changes applied before test

Patch application report for D2907 (id=10409)

Changes applied before test

Revision Contents
Changeset List

Diff 10470

swh/loader/package/nixguix/loader.py

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources-EOFError.json

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json_visit1

swh/loader/package/nixguix/tests/test_functional.py

nixguix: use the integrity attribute as snapshot branch nameClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D2907 (id=10376)

Changes applied before test

Patch application report for D2907 (id=10377)

Changes applied before test

Patch application report for D2907 (id=10409)

Changes applied before test

Revision ContentsChangeset List

Diff 10470

swh/loader/package/nixguix/loader.py

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources-EOFError.json

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json

swh/loader/package/nixguix/tests/data/https_nix-community.github.io/nixpkgs-swh_sources.json_visit1

swh/loader/package/nixguix/tests/test_functional.py

nixguix: use the integrity attribute as snapshot branch name
ClosedPublic
Actions

Revision Contents
Changeset List