Page MenuHomeSoftware Heritage

Allow more checksum computations in Content model
AbandonedPublic

Authored by ardumont on Sep 29 2022, 5:57 PM.

Details

Summary

This does not change the default behavior.

  • Content: Allow to get more hash than the default ones Only if it's supported by the model.
  • hashutil: Open more checksum algorithms. Those are the ones supported by the integrity field [1]. This will be needed for the future new content/directory loaders.

[1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo

Related to T3781

Test Plan

I'll check the coverage and maybe add some missing tests in due time

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31946
Build 49998: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49997: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8582 (id=30975)

Rebasing onto 2d65a24a5f...

Current branch diff-target is up to date.
Changes applied before test
commit 50d859587a7b33d3a78795881d6480ef38749118
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:54:35 2022 +0200

    hashutil: Open more checksum algorithms
    
    Those are the ones supported by the integrity field [1]. This will be needed for the
    future new content/directory loaders.
    
    [1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo
    
    Related to T3781

commit 39ddb6947b0fb8d590a72f0046b6b4f3e2627230
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:48:38 2022 +0200

    Content: Allow to get more hash than the default ones
    
    Only if it's supported by the model.
    
    Related to D8581

commit b4900f3d327a685a53cb31ed0cbc64ebce2a2e0f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:44:21 2022 +0200

    Allow more checksum computations in Content model
    
    This does not change the default behavior.
    
    Related to D8581

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/532/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/model/model.py
1324

Please, that error message means nothing out of context.

This revision is now accepted and ready to land.Sep 29 2022, 6:21 PM

Adding tests revealed it missed stuff...

Build is green

Patch application report for D8582 (id=30980)

Rebasing onto 2d65a24a5f...

Current branch diff-target is up to date.
Changes applied before test
commit f92f6e8fdd191e9fd99e07e78ade90877bad425b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:54:35 2022 +0200

    hashutil: Open more checksum algorithms
    
    This adds the ones supported by the integrity field [1]. This will be needed for the
    future new content/directory loaders.
    
    [1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo
    
    Related to T3781

commit f816635e94272cb7aa13b5e204f97acb8f99eac4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:48:38 2022 +0200

    Content: Allow to get more hash than the default ones
    
    Only if it's supported by the model.
    
    Related to D8581

commit 4c139cfaafcf7ce8b3c6527fdf2928ea226dfaed
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:44:21 2022 +0200

    Allow more checksum computations in Content model
    
    This does not change the default behavior. It just allows to have a content with a bit
    more checksums than the default one.
    
    Related to D8581

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/534/ for more details.

anlambert added inline comments.
swh/model/hashutil.py
68

Awesome, I also need sha512 for some package loader improvements I am working on :-)

ardumont added inline comments.
swh/model/hashutil.py
68

cool, i'm wondering whether we want to open the old hashes though.
I see here a need for sha384 and sha512 (your need and mine about the "integrity" field).
The others i'm not sure, maybe for old compatibilty things...

any opinion on your side?

swh/model/hashutil.py
68

what does "open the old hashes" mean?

I see here a need for sha384 and sha512 (your need and mine about the "integrity" field).

Do we actually have sources of SRIs with only sha384 or only sha512?

olasd requested changes to this revision.Sep 30 2022, 10:42 AM
olasd added a subscriber: olasd.

I don't agree with the idea of adding additional hashes, that will not be stored, to the Content model object. Model objects should map 1:1 with what is stored in the archive.

Why are the changes to the model object needed, instead of just hashing the file directly?

This revision now requires changes to proceed.Sep 30 2022, 10:42 AM
swh/model/hashutil.py
68

what does "open the old hashes" mean?

The ones already present in ALGORITHMS prior to this diff.
md5 is also an old checksum we never used afair.
The blake ones we did because at some point we had them in the storage db (and we did not yet choose which one to keep in the db).

Do we actually have sources of SRIs with only sha384 or only sha512?

D8581#223108 shows sha512. @anlambert also mentions the need for sha512.


The thing is, in the current implementation i'm forced to open all attributes fields in the (Skipped)Content model now.

Why are the changes to the model object needed, instead of just hashing the file directly?

You mean using MultiHash directly. Yes, that feels more and more like the right way to do it.

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

In D8582#223177, @olasd wrote:

Why are the changes to the model object needed, instead of just hashing the file directly?

I agree the change to Content is not needed, we just need MultiHash to support it.

(actually, that's not even needed because the content loader does not stream content, so it could call hashlib directly)

Model objects should map 1:1 with what is stored in the archive.

You are right, that skipped my mind.

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

then use MultiHash directly in the Content loader, instead of Content.from_data

Why are the changes to the model object needed, instead of just hashing the file directly?

You mean using MultiHash directly. Yes, that feels more and more like the right way to do it.

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

Ack, I understand. I think this is a relatively edge case, so it's fine to read the file again to compute more hashes (it'll probably be on a tmpfs anyway)

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

then use MultiHash directly in the Content loader, instead of Content.from_data

Yeah, I guess that would work too (building a Content object out of a MultiHash, manually).

Thanks! I'll close this.

I'm gonna attend to the gist of the DirectoryLoader first (the same way the content loader is doing it), that's gonna do it for the 80%-20%.

And then i'll do an extra diff about computing the right integrity field (20%) about the potential unsupported hash.