Page MenuHomeSoftware Heritage

model.hashutil: Open new endpoint to allow to hash stream
ClosedPublic

Authored by ardumont on Sep 14 2018, 1:43 AM.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

olasd requested changes to this revision.Sep 14 2018, 3:29 PM

NACK.

This is not generic at all as the iter_content() method is entirely specific to requests.

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

class MultiHash:
    def __init__(self, hash_names, length=None):
        self.state = {}
        self.track_length = False
        for name in hash_names:
            if name == 'length':
                self.state['length'] = 0
                self.track_length = True
            else:
                self.state[name] = _new_hash(name, length)

    @classmethod
    def from_state(cls, state, track_length):
        ret = cls([])
        ret.state = state
        ret.track_length = track_length

    def update(self, chunk):
        for name, h in self.state.items():
            if name == 'length':
                continue
            h.update(chunk)
        if self.track_length:
            self.state['length'] += len(chunk)
    
    def digest(self):
        return {
            name: h.digest() if name != 'length' else h
            for name, h in self.state.items()
        }
    
    def hexdigest(self):
        return {
            name: h.hexdigest() if name != 'length' else h
            for name, h in self.state.items()
        }

    def copy(self):
        copied_state = {
            name: h.copy() if name != 'length' else h
            for name, h in self.state.items()
        }
        return self.from_state(copied_state, self.track_length)

(untested code, I've not even run it)

and going from there.

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

This revision now requires changes to proceed.Sep 14 2018, 3:29 PM

NACK
This is not generic at all as the iter_content() method is entirely specific to requests.

Ah yeah. I did not check that iter_content was specific.

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

I did not want to rewrite everything though.

I hesitated to transform hash_file into something more generic without changing much of the existing code.
(that's why hash_file got touched by the way).

Extracting a read as a parameter function (defaulting to read from a file).
That way, the client provides the way to read, thus allowing to hide that implementation detail.

It's not as beautiful as suggested but it would touch less stuff (even though it's tested :).

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

If we go that road, indeed.

Note:
at all plus entirely in the same sentence are all correct.
I would have understand the message without those though.

...
(untested code, I've not even run it)

Well, i don't copy/paste like a furious, don't worry.
I don't grok all of it yet but i see some interesting stuff.
And i concur that hexdigest and even with_length becomes irrelevant.

For the sha1_git computation though, we need the length at first.
I don't see where we plug that yet.

I don't see where we plug that yet.

heh, it's plugged already ;)

In D410#7949, @ardumont wrote:

To be most generic, I think providing a replica of the built-in Python hashlib API but supporting multiple hashes would be preferable:

I did not want to rewrite everything though.

I understand.

I hesitated to transform hash_file into something more generic without changing much of the existing code.
(that's why hash_file got touched by the way).

Extracting a read as a parameter function (defaulting to read from a file).
That way, the client provide the way to read allows to hide that detail.

It's not as beautiful as suggested but it would touch less stuff (even though it's tested :).

Sure.

However, the test coverage of this module is already pretty good, and rewriting hash_file in terms of the class would use most of it (at the exception of copy and of one of the two digest functions).

I'm not convinced that the added hexdigest argument is really necessary if we provide this interface.

If we go that road, indeed.

I'm not too excited by the combinatorial explosion of adding a new argument to all functions in this module (which we should do if we want to keep it consistent).

If we do want to add this argument, we should keep the naming consistent and provide the three output formats (hash_to_*) we support:

  • 'bytes'
  • 'hex'
  • 'bytehex'

Thinking forward, I could see a second, later refactoring step that would reimplement the hash_* functions as classmethods of the new MultiHash class, so you'd be able to call

MultiHash.from_file(file_object).digest()
MultiHash.from_path(b'foo', track_length=True).hexdigest()
MultiHash.from_data(b'foo', track_length=True).bytehexdigest()  # new method outputting the bytehex format dulwich needs

if you want a non-default hash format.

To be honest, my secret goal is to work towards replacing the callback pattern as it makes us write code that's a twisty and (IMO) unpythonic, e.g. when we want to collect the chunks to use them later.

Note:
at all plus entirely in the same sentence are all correct.
I would have understand the message without those though.

Fair enough, sorry for the stronger than needed language.

  • hashutil: Add MultiHash to compute hashes of content in 1 roundtrip
  • hashutil: Set the hash_names defaulting to swh default hash algo
  • swh.model.hashutil: Remove unnecessary endpoints

Amend commit to remove the unnecessary associated docstring as well

  • swh.model.hashutil: Remove unnecessary endpoints

2nd step implementation (without migrating existing modules use)

  • hashutil: Improve MultiHash class from_* to compute hashes
  • swh.model.hashutil: Mark hash_* function as deprecated

Thinking forward, I could see a second, later refactoring step that would reimplement the hash_* functions as classmethods of the new MultiHash class, so you'd be able to call

It's done.

The historic functions hash_{data,file,path} are still present and marked as deprecated.
Their behavior has been reverted to their initial one (returning only bytes digest)

Hopefully, this will discourage their use for the MultiHash one.

Remains to migrate our base code to those but that can be delayed and is independent from this diff.

Cheers,

swh/model/hashutil.py
99

I don't think this is used much. IIRC, it's only used once.

So, i'm for removing that part later (when we will have migrated the base code).

Thanks for the new version!

You've kept some docstrings for new arguments to the hash_* functions that are not implemented anymore (hash_format), and you've added a track_length argument to hash_path that's not being used. Once that's cleaned up those changes should be ready to land.

swh/model/hashutil.py
99

In any case, we should just use the length that we've retrieved from os.path.getsize rather than computing it again.

This revision is now accepted and ready to land.Sep 17 2018, 9:59 AM
ardumont added inline comments.
swh/model/hashutil.py
99

Right, fixing that as well.

ardumont marked an inline comment as done.
  • swh.model.hashutil: Reuse hash_path's old definition
  • MultiHash.from_path: Use the length coming from os.path
  • hashutil: Remove unused variables
  • hashutil: Update module and functions docstrings
  • hashutil: Clarify further the module dostring