Allow more checksum computations in Content model
AbandonedPublic
Actions

Authored by ardumont on Sep 29 2022, 5:57 PM.

Details

Reviewers

vlorentz
olasd

Group Reviewers

Reviewers

Maniphest Tasks

T3781: Replace the Nixguix loader with a lister

Summary

This does not change the default behavior.

Content: Allow to get more hash than the default ones Only if it's supported by the model.

hashutil: Open more checksum algorithms. Those are the ones supported by the integrity field [1]. This will be needed for the future new content/directory loaders.

[1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo

Related to T3781

Test Plan

I'll check the coverage and maybe add some missing tests in due time

Diff Detail

Repository

rDMOD Data model

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 31951
Build 50008: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 50007: arc lint + arc unit

Event Timeline

ardumont created this revision.Sep 29 2022, 5:57 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 29 2022, 5:57 PM

Build is green

Patch application report for D8582 (id=30975)

Rebasing onto 2d65a24a5f...

Current branch diff-target is up to date.

Changes applied before test

commit 50d859587a7b33d3a78795881d6480ef38749118
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:54:35 2022 +0200

    hashutil: Open more checksum algorithms
    
    Those are the ones supported by the integrity field [1]. This will be needed for the
    future new content/directory loaders.
    
    [1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo
    
    Related to T3781

commit 39ddb6947b0fb8d590a72f0046b6b4f3e2627230
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:48:38 2022 +0200

    Content: Allow to get more hash than the default ones
    
    Only if it's supported by the model.
    
    Related to D8581

commit b4900f3d327a685a53cb31ed0cbc64ebce2a2e0f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:44:21 2022 +0200

    Allow more checksum computations in Content model
    
    This does not change the default behavior.
    
    Related to D8581

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/532/ for more details.

Harbormaster completed remote builds in B31946: Diff 30975.Sep 29 2022, 5:59 PM

ardumont requested review of this revision.Sep 29 2022, 5:59 PM

ardumont mentioned this in D8581: Add Content Loader to ingest raw content file.Sep 29 2022, 6:14 PM

vlorentz accepted this revision.Sep 29 2022, 6:21 PM

vlorentz added a subscriber: vlorentz.

vlorentz added inline comments.

swh/model/model.py
1323	Please, that error message means nothing out of context.

This revision is now accepted and ready to land.Sep 29 2022, 6:21 PM

Adding tests revealed it missed stuff...

Build is green

Patch application report for D8582 (id=30980)

Rebasing onto 2d65a24a5f...

Current branch diff-target is up to date.

Changes applied before test

commit f92f6e8fdd191e9fd99e07e78ade90877bad425b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:54:35 2022 +0200

    hashutil: Open more checksum algorithms
    
    This adds the ones supported by the integrity field [1]. This will be needed for the
    future new content/directory loaders.
    
    [1] https://w3c.github.io/webappsec-subresource-integrity/#grammardef-hash-algo
    
    Related to T3781

commit f816635e94272cb7aa13b5e204f97acb8f99eac4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:48:38 2022 +0200

    Content: Allow to get more hash than the default ones
    
    Only if it's supported by the model.
    
    Related to D8581

commit 4c139cfaafcf7ce8b3c6527fdf2928ea226dfaed
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 29 17:44:21 2022 +0200

    Allow more checksum computations in Content model
    
    This does not change the default behavior. It just allows to have a content with a bit
    more checksums than the default one.
    
    Related to D8581

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/534/ for more details.

Harbormaster completed remote builds in B31951: Diff 30980.Sep 29 2022, 7:16 PM

anlambert added a subscriber: anlambert.Sep 29 2022, 7:42 PM

anlambert added inline comments.

swh/model/hashutil.py
69	Awesome, I also need sha512 for some package loader improvements I am working on :-)

ardumont marked an inline comment as done.Sep 30 2022, 9:23 AM

ardumont added inline comments.

swh/model/hashutil.py
69	cool, i'm wondering whether we want to open the old hashes though. I see here a need for sha384 and sha512 (your need and mine about the "integrity" field). The others i'm not sure, maybe for old compatibilty things... any opinion on your side?

vlorentz added inline comments.Sep 30 2022, 10:37 AM

swh/model/hashutil.py
69	what does "open the old hashes" mean? I see here a need for sha384 and sha512 (your need and mine about the "integrity" field). Do we actually have sources of SRIs with only sha384 or only sha512?

I don't agree with the idea of adding additional hashes, that will not be stored, to the Content model object. Model objects should map 1:1 with what is stored in the archive.

Why are the changes to the model object needed, instead of just hashing the file directly?

This revision now requires changes to proceed.Sep 30 2022, 10:42 AM

ardumont added inline comments.Sep 30 2022, 10:43 AM

swh/model/hashutil.py
69	what does "open the old hashes" mean? The ones already present in ALGORITHMS prior to this diff. md5 is also an old checksum we never used afair. The blake ones we did because at some point we had them in the storage db (and we did not yet choose which one to keep in the db). Do we actually have sources of SRIs with only sha384 or only sha512? D8581#223108 shows sha512. @anlambert also mentions the need for sha512. The thing is, in the current implementation i'm forced to open all attributes fields in the (Skipped)Content model now.

Why are the changes to the model object needed, instead of just hashing the file directly?

You mean using MultiHash directly. Yes, that feels more and more like the right way to do it.

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

In D8582#223177, @olasd wrote:

Why are the changes to the model object needed, instead of just hashing the file directly?

I agree the change to Content is not needed, we just need MultiHash to support it.

(actually, that's not even needed because the content loader does not stream content, so it could call hashlib directly)

Model objects should map 1:1 with what is stored in the archive.

You are right, that skipped my mind.

In D8582#223184, @ardumont wrote:

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

then use MultiHash directly in the Content loader, instead of Content.from_data

In D8582#223184, @ardumont wrote:

Why are the changes to the model object needed, instead of just hashing the file directly?

You mean using MultiHash directly. Yes, that feels more and more like the right way to do it.

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

Ack, I understand. I think this is a relatively edge case, so it's fine to read the file again to compute more hashes (it'll probably be on a tmpfs anyway)

In D8582#223187, @vlorentz wrote:

In D8582#223184, @ardumont wrote:

It was initially to just compute that new hash at the same time as the standard ones we store but that's getting out of hand.

then use MultiHash directly in the Content loader, instead of Content.from_data