Details

Reviewers

Group Reviewers

Commits

rDMODeb338cdabcc1: hashutil: Clarify further the module dostring
rDMOD5676bd606626: hashutil: Remove unused variables
rDMOD8c26ddb043da: hashutil: Update module and functions docstrings
rDMOD50e583f2c2d2: MultiHash.from_path: Use the length coming from os.path
rDMOD4959a4506475: swh.model.hashutil: Mark hash_* function as deprecated
rDMOD7bfb0a8088c7: swh.model.hashutil: Reuse hash_path's old definition
rDMOD9084f96c9eb1: hashutil: Improve MultiHash class from_* to compute hashes
rDMOD836198c41169: swh.model.hashutil: Remove unnecessary endpoints
rDMOD0e71ebfa4f87: hashutil: Add MultiHash to compute hashes of content in 1 roundtrip
rDMOD3b9e8e91df3c: hashutil: Set the hash_names defaulting to swh default hash algo
rDMOD7f885ed5506d: swh.model.hashutil: Open hash_stream endpoint
rDMOD5ca5dce7216f: hashutil: Allow option to require hexdigest instead of binary digest

Summary

And open swh.model.hashutil.hash_stream endpoint

Related T421

Test Plan

make test

Diff Detail

Repository

rDMOD Data model

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Sep 14 2018, 1:43 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 14 2018, 1:43 AM

Harbormaster completed remote builds in B1385: Diff 1273.Sep 14 2018, 1:43 AM

ardumont mentioned this in D408: Bootstrap pypi loader.Sep 14 2018, 1:48 AM

NACK.

This is not generic at all as the iter_content() method is entirely specific to requests.

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

class MultiHash:
    def __init__(self, hash_names, length=None):
        self.state = {}
        self.track_length = False
        for name in hash_names:
            if name == 'length':
                self.state['length'] = 0
                self.track_length = True
            else:
                self.state[name] = _new_hash(name, length)

    @classmethod
    def from_state(cls, state, track_length):
        ret = cls([])
        ret.state = state
        ret.track_length = track_length

    def update(self, chunk):
        for name, h in self.state.items():
            if name == 'length':
                continue
            h.update(chunk)
        if self.track_length:
            self.state['length'] += len(chunk)
    
    def digest(self):
        return {
            name: h.digest() if name != 'length' else h
            for name, h in self.state.items()
        }
    
    def hexdigest(self):
        return {
            name: h.hexdigest() if name != 'length' else h
            for name, h in self.state.items()
        }

    def copy(self):
        copied_state = {
            name: h.copy() if name != 'length' else h
            for name, h in self.state.items()
        }
        return self.from_state(copied_state, self.track_length)

(untested code, I've not even run it)

and going from there.

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

This revision now requires changes to proceed.Sep 14 2018, 3:29 PM

NACK
This is not generic at all as the iter_content() method is entirely specific to requests.

Ah yeah. I did not check that iter_content was specific.

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

I did not want to rewrite everything though.

I hesitated to transform hash_file into something more generic without changing much of the existing code.
(that's why hash_file got touched by the way).

Extracting a read as a parameter function (defaulting to read from a file).
That way, the client provides the way to read, thus allowing to hide that implementation detail.

It's not as beautiful as suggested but it would touch less stuff (even though it's tested :).

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

If we go that road, indeed.

Note:
at all plus entirely in the same sentence are all correct.
I would have understand the message without those though.

...
(untested code, I've not even run it)

Well, i don't copy/paste like a furious, don't worry.
I don't grok all of it yet but i see some interesting stuff.
And i concur that hexdigest and even with_length becomes irrelevant.

For the sha1_git computation though, we need the length at first.
I don't see where we plug that yet.

I don't see where we plug that yet.

heh, it's plugged already ;)

In D410#7949, @ardumont wrote:

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

I did not want to rewrite everything though.

I understand.

I hesitated to transform hash_file into something more generic without changing much of the existing code.
(that's why hash_file got touched by the way).

Extracting a read as a parameter function (defaulting to read from a file).
That way, the client provide the way to read allows to hide that detail.

It's not as beautiful as suggested but it would touch less stuff (even though it's tested :).

Sure.

However, the test coverage of this module is already pretty good, and rewriting hash_file in terms of the class would use most of it (at the exception of copy and of one of the two digest functions).

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

If we go that road, indeed.

I'm not too excited by the combinatorial explosion of adding a new argument to all functions in this module (which we should do if we want to keep it consistent).

If we do want to add this argument, we should keep the naming consistent and provide the three output formats (hash_to_*) we support:

'bytes'
'hex'
'bytehex'

Thinking forward, I could see a second, later refactoring step that would reimplement the hash_* functions as classmethods of the new MultiHash class, so you'd be able to call

MultiHash.from_file(file_object).digest()
MultiHash.from_path(b'foo', track_length=True).hexdigest()
MultiHash.from_data(b'foo', track_length=True).bytehexdigest()  # new method outputting the bytehex format dulwich needs

if you want a non-default hash format.

To be honest, my secret goal is to work towards replacing the callback pattern as it makes us write code that's a twisty and (IMO) unpythonic, e.g. when we want to collect the chunks to use them later.

Note:
at all plus entirely in the same sentence are all correct.
I would have understand the message without those though.

Fair enough, sorry for the stronger than needed language.

hashutil: Add MultiHash to compute hashes of content in 1 roundtrip

Harbormaster completed remote builds in B1387: Diff 1276.Sep 14 2018, 8:07 PM

ardumont mentioned this in rDLDPY84f0de41b63f: pypi.client: Update according to latest swh.model.hashutil api.Sep 14 2018, 8:10 PM

hashutil: Set the hash_names defaulting to swh default hash algo
swh.model.hashutil: Remove unnecessary endpoints

Harbormaster completed remote builds in B1388: Diff 1277.Sep 14 2018, 11:49 PM

Amend commit to remove the unnecessary associated docstring as well

swh.model.hashutil: Remove unnecessary endpoints

Harbormaster completed remote builds in B1389: Diff 1278.Sep 14 2018, 11:50 PM

2nd step implementation (without migrating existing modules use)

hashutil: Improve MultiHash class from_* to compute hashes

Harbormaster completed remote builds in B1390: Diff 1279.Sep 15 2018, 12:40 AM

swh.model.hashutil: Mark hash_* function as deprecated

Harbormaster completed remote builds in B1391: Diff 1280.Sep 15 2018, 12:50 AM

Thinking forward, I could see a second, later refactoring step that would reimplement the hash_* functions as classmethods of the new MultiHash class, so you'd be able to call

It's done.

The historic functions hash_{data,file,path} are still present and marked as deprecated.
Their behavior has been reverted to their initial one (returning only bytes digest)

Hopefully, this will discourage their use for the MultiHash one.

Remains to migrate our base code to those but that can be delayed and is independent from this diff.

Cheers,

ardumont added inline comments.Sep 15 2018, 12:58 AM

swh/model/hashutil.py
99	I don't think this is used much. IIRC, it's only used once. So, i'm for removing that part later (when we will have migrated the base code).

ardumont mentioned this in rDLDPY7b84f4bc06d9: pypi.client: Use new MultiHash.from_path method.Sep 15 2018, 12:59 AM

Thanks for the new version!

You've kept some docstrings for new arguments to the hash_* functions that are not implemented anymore (hash_format), and you've added a track_length argument to hash_path that's not being used. Once that's cleaned up those changes should be ready to land.