Page MenuHomeSoftware Heritage

hashutil: Migrate towards MultiHash api
ClosedPublic

Authored by ardumont on Sep 21 2018, 3:51 PM.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

olasd requested changes to this revision.Sep 27 2018, 1:26 PM

A few comments inline, most notably regarding the behavior of file hashing.

swh/model/from_disk.py
80–83

I don't like the verbosity of passing the length argument to from_data to avoid one call to len() which is a pretty minor thing compared to actually hashing the data :) However I won't argue that this needs to change.

127–131

This should do multiple reads with a fixed chunk size, rather than relying on the iterator behavior of file objects: the iterator will split the input line by line, which will make us read binary files all at once and possibly blow up our memory usage.

I now notice that MultiHash.from_file should do the same :)

This revision now requires changes to proceed.Sep 27 2018, 1:26 PM
zack added inline comments.
swh/model/from_disk.py
80–83

A minor +1 on not passing length here too, mostly because it opens up to inconsistency risks: one might pass a length that is not the actual length of data, which makes this API more dangerous than it could be.

swh/model/from_disk.py
80–83

I don't like the verbosity of passing the length argument to from_data to avoid one call to len()

I don't either...

A minor +1 on not passing length here too, ...

ok.

But...

...API more dangerous than it could be.

To be fair, the internal hash_* were already this way (cf. e.g. hash_file).
Those were initially like that because you need the length for the sha1_git computation.

I was trying to find a middleground here.


What i would actually like to see from the MultiHash from_* is a way to ask for the length (raw_data even?) computed from the content/path/file. That'd be consistent with our use, we often want it (more than not) in the model clients (loaders mostly).
That'd be an option though.

For example, a parameter to the *digest method?

What do you think?

127–131

Right!

  • swh.model: Do multiple reads with a fixed chunk size
swh/model/from_disk.py
80–83

Only hash_file really needs the length to be passed by the caller (Python file objects don't generally know their size), all other API calls can infer it from the source. That's also the way the old API was designed, and I missed that when reviewing the new API.

I can propose a diff to hashutil to clean that up?

127–131

👌

  • swh.model.hashutil: Remove extra length parameter
This revision is now accepted and ready to land.Sep 28 2018, 11:09 AM