Paths

Table of Contentst

Differential D435

hashutil: Migrate towards MultiHash api
ClosedPublic
Actions

Authored by ardumont on Sep 21 2018, 3:51 PM.

Tags

None

Subscribers

Details

Reviewers

Group Reviewers

Commits

rDMODac3b0f86ff79: swh.model.hashutil: Remove extra length parameter
rDMOD0119f4c1a0b4: swh.model: Do multiple reads with a fixed chunk size
rDMOD348702568764: hashutil: Migrate towards MultiHash api

Summary

From within swh-model at least ;)

Test Plan

make test

Diff Detail

Repository

rDMOD Data model

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Sep 21 2018, 3:51 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 21 2018, 3:51 PM

Harbormaster completed remote builds in B1429: Diff 1343.Sep 21 2018, 3:51 PM

A few comments inline, most notably regarding the behavior of file hashing.

swh/model/from_disk.py
80–83	I don't like the verbosity of passing the length argument to from_data to avoid one call to `len()` which is a pretty minor thing compared to actually hashing the data :) However I won't argue that this needs to change.
127–131	This should do multiple reads with a fixed chunk size, rather than relying on the iterator behavior of file objects: the iterator will split the input line by line, which will make us read binary files all at once and possibly blow up our memory usage. I now notice that `MultiHash.from_file` should do the same :)

This revision now requires changes to proceed.Sep 27 2018, 1:26 PM

zack added a subscriber: zack.Sep 27 2018, 1:48 PM

zack added inline comments.

swh/model/from_disk.py
80–83	A minor +1 on not passing length here too, mostly because it opens up to inconsistency risks: one might pass a length that is not the actual length of data, which makes this API more dangerous than it could be.

ardumont added inline comments.Sep 27 2018, 2:13 PM

swh/model/from_disk.py
80–83	I don't like the verbosity of passing the length argument to from_data to avoid one call to len() I don't either... A minor +1 on not passing length here too, ... ok. But... ...API more dangerous than it could be. To be fair, the internal `hash_` were already this way (cf. e.g. `hash_file`). Those were initially like that because you need the length for the sha1_git computation. I was trying to find a middleground here. What i would actually like to see from the MultiHash `from_` is a way to ask for the length (raw_data even?) computed from the content/path/file. That'd be consistent with our use, we often want it (more than not) in the model clients (loaders mostly). That'd be an option though. For example, a parameter to the `*digest` method? What do you think?
127–131	Right!

swh.model: Do multiple reads with a fixed chunk size

Harbormaster completed remote builds in B1454: Diff 1374.Sep 27 2018, 2:29 PM

ardumont marked 2 inline comments as done.Sep 27 2018, 2:38 PM

olasd added inline comments.Sep 27 2018, 4:14 PM

swh/model/from_disk.py
80–83	Only `hash_file` really needs the length to be passed by the caller (Python file objects don't generally know their size), all other API calls can infer it from the source. That's also the way the old API was designed, and I missed that when reviewing the new API. I can propose a diff to hashutil to clean that up?
127–131	👌

swh.model.hashutil: Remove extra length parameter

Harbormaster completed remote builds in B1455: Diff 1375.Sep 27 2018, 5:44 PM

ardumont mentioned this in D436: debian.loader: Use new swh.model.hashutil.MultiHash api.Sep 27 2018, 5:54 PM

olasd accepted this revision.Sep 28 2018, 11:09 AM

This revision is now accepted and ready to land.Sep 28 2018, 11:09 AM

Closed by commit rDMOD348702568764: hashutil: Migrate towards MultiHash api (authored by ardumont). · Explain WhySep 28 2018, 11:14 AM

This revision was automatically updated to reflect the committed changes.

ardumont mentioned this in rDLDGb40d1c544a21: git.converters: Remove length parameter from the MultiHash call.

Revision Contents
Changeset List

Path

Size

swh/

model/

27 lines

5 lines

9 lines

Diff 1384

swh/model/from_disk.py

Loading...

swh/model/identifiers.py

Loading...

swh/model/validators.py

Loading...