Page MenuHomeSoftware Heritage

nixguix: use the integrity attribute as snapshot branch name
ClosedPublic

Authored by lewo on Mar 27 2020, 3:52 PM.

Details

Summary

Snapshot branch names were the url of sources. This url is used for
incremental loading: if an url is already existing in the last
snapshot, this url is not downloaded again.

Using the url as key for this cache has several drawbacks:

  • if the upstream source is upgraded in place, the snapshot will reuse the previous version of the upstream source.
  • when the loader supports url mirrors, using the url as branch name will be confusing (we won't know which url to use)

Moreover, this integrity attribute will be useful to get artifacts
from SWH once a metadata API will be available.

Related to T1991

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
nixguix-integrity
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11483
Build 17406: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 17405: arc lint + arc unit

Event Timeline

  • nixguix: rename the test file

I'm not too sure about making the branch name be an opaque sha256. It's not great from a "end user browsing the archive" point of view.

The approach we've taken with other loaders that have upstream-provided integrity information (e.g. PyPI) is a bit different:

  • we use human-readable version numbers as branch names
  • we store the hashes of the downloaded artifacts in the revision metadata

When we download an origin again

  • we retrieve the list of hashes from upstream
  • we retrieve all revisions from the previous snapshot
  • we compare the hashes between upstream metadata and already loaded revisions
  • we populate the (new) snapshot with objects from the previous snapshot which match the current upstream metadata
  • we only download artifacts for which the upstream hash doesn't match anything we have loaded so far

This keeps the snapshot "human-readable", while avoiding multiple downloads of stuff we know is (should be) the same between runs.

The main drawback of that approach is that you can't easily query the revision by hash; For now, this is something that you'd need to do in a specific cache. Querying objects by arbitrary metadata is something that we want to support eventually, but we're not there yet.

I think this also solves the "mirrors" problem: once you've downloaded an artifact with the expected hash, you can just have as many branches as you want pointing to that same object. This is what happens in the Debian loader: if several suites have the same version of a package (identified by the set of hashes of the metadata files), we load that version once, then have several pointers to it in our snapshot.

Build is green

I triggered a build back to check whether the great new improvments that @olasd installed would have worked on the diff.
That did not unfortunately ;)


Just to mention that technically, we usually use something like [1] to define the identity of an artifact.
Which could be the same here.

I think the mechanism used in the archive loader works because it only deals with artifact whose metadata are stored in the revision though.
As the metadata is not stored in the revision for all files this loader will ingest (patch, gem, etc...).
There is no comparison possible on next visits. So that will be limited though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/archive/loader.py$95-96

Build is green

I triggered a build back to check whether the great new improvments that @olasd installed would have worked on the diff.
That did not unfortunately ;)

The changes I introduced to CI need patch submitters to run arc diff --update so their changes are pushed to the staging area. Triggering the old job again won't do that.


Just to mention that technically, we usually use something like [1] to define the identity of an artifact.
Which could be the same here.

I think the mechanism used in the archive loader works because it only deals with artifact whose metadata are stored in the revision though.
As the metadata is not stored in the revision for all files this loader will ingest (patch, gem, etc...).
There is no comparison possible on next visits. So that will be limited though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/archive/loader.py$95-96

I'm not sure that we should be running the "load arbitrary files" part of this loader in production until we have a way to store arbitrary metadata on release objects. Once we have that then all objects can be handled the same, I suppose.

Thanks for your review. I'm applying your suggestion which looks nice.

Querying objects by arbitrary metadata is something that we want to support eventually, but we're not there yet.

Hm, this feature would be really nice to have in my context ;)

I think this also solves the "mirrors" problem: once you've downloaded an artifact with the expected hash, you can just have as many branches as you want pointing to that same object. This is what happens in the Debian loader: if several suites have the same version of a package (identified by the set of hashes of the metadata files), we load that version once, then have several pointers to it in our snapshot.

Yes, it solves it.

No longer use the integrity as branch name but the url instead.

Build is green

Patch application report for D2907 (id=10376)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.
Changes applied before test
commit e73873b9b4953af916c375ad61fdc9b87bdfd588
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/3/ for more details.

Build is green

Patch application report for D2907 (id=10377)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.
Changes applied before test
commit ae41eaa51da65031a0e8ab7cea213f03c5d42879
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/4/ for more details.

@olasd Another drawback is I would need to make one API call per release in order to list all integrity attributes of a snapshot. Even with a cache on the client side, the first run would require ~15000 calls to the SWH API. Do you have a rate limiting on these API endpoints?

The implementation looks fine to me.

For the sake of checking the following (please ;)

I can't shake the feeling something is missing in tests.
All the more reasons because the tooling agrees.

Particularly around the incremental visit approach.
The most appropriate test to check seems to be the test_loader_incremental.
Can you check the 2 loader instanciations done there is correct?
The test seems to make the test pass for the wrong reason (thus the code coverage miss).

I think removing this double instanciation will make the test pass with the right coverage this time.

Cheers,

swh/loader/package/nixguix/loader.py
82

I would expect the coverage to currently be green here...

swh/loader/package/nixguix/tests/test_functional.py
90–91

Why do we use 2 loaders again?
Can you please remove it and see what happens?

This revision now requires changes to proceed.Mar 31 2020, 11:38 AM

The changes I introduced to CI need patch submitters to run arc diff --update so their changes are pushed to the staging area. Triggering the old job again won't do that.

thanks by the way ;)

Remove the second loader instanciation in test_loader_incremental.

Build is green

Patch application report for D2907 (id=10409)

Rebasing onto 856cf702be...

Current branch diff-target is up to date.
Changes applied before test
commit 0ff6cdedf0fcf89e4e2e8dc331fdb46aefb41be0
Author: Antoine Eiche <lewo@abesis.fr>
Date:   Fri Mar 27 15:06:18 2020 +0100

    nixguix: add the integrity attribute in release metadata
    
    The integrity attribute is also used by the incremental loading
    feature.
    
    Before, the url was used for incremental loading: if an url is already
    existing in the last snapshot, this url is not downloaded again.
    
    Using the url as key for this cache has several drawbacks:
    - if the upstream source is upgraded in place, the snapshot will reuse
      the previous version of the upstream source.
    - when the loader supports url mirrors, using the url as branch name will
      be confusing (we won't know which url to use)
    
    The integrity attribute is now used as key (instead of the url).
    
    Moreover, this integrity attribute will be useful to get artifacts
    from SWH once a metadata API will be available.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/5/ for more details.

swh/loader/package/nixguix/loader.py
82

I locally put a breakpoint line 82 and the breakpoint has been reached. So, this code is executed by the test test_loader_incremental.
It seems to be a false positive of the coverage tooling. I don't know what I could do then.

This revision is now accepted and ready to land.Mar 31 2020, 1:09 PM