Page MenuHomeSoftware Heritage

Open variable length hash algorithm support in swh.model.hashutil._new_hash
ClosedPublic

Authored by ardumont on Mar 14 2017, 4:05 PM.

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 783
Build 1052: arc lint + arc unit

Event Timeline

Make an inexistant variable length hash algo instantiation break

zack requested changes to this revision.Mar 16 2017, 10:10 AM
zack added inline comments.
swh/model/hashutil.py
30–31

Given the syntax for specifying hashing algorithms has now become more complex, this module now deserves a module-level docstring describing the syntax itself. All other module functions that take a algorithm[s] parameter can remain as they are, they will just implicitly refer to the module-level doc on how to specify it.

31–37

Here we could add at least blake2 (which is supposed in py 3.5), right?

92

This test looks wrong.
Also, if what is passed after the : is not an int, this function will fail later on with a random exception, rather than with a proper "wrong hashing algo" ValueError.

I propose that at the beginning of this function we properly split "algo" into a pair algo/length, with length possibly None, doing all the validity checks on those and fail early if need be. The rest of the code of the function relies on the parsing having already been done.

100–102

As per my suggestion above, this would have already been done when reaching this point.

This revision now requires changes to proceed.Mar 16 2017, 10:10 AM
swh/model/hashutil.py
31–37

Here we could add at least blake2 (which is supposed in py 3.5), right?

On my machine, which is testing, i have blake2b512, blake2s256 indeed:

$ python3
>>> import hashlib
>>> hashlib.algorithms_available
{'BLAKE2s256', 'sha256', 'SHA1', 'SHA256', 'MD5-SHA1', 'sha224', 'blake2s256', 'sha1', 'BLAKE2b512', 'md5', 'sha512', 'SHA512', 'RIPEMD160', 'ripemd160', 'MD5', 'SHA384', 'whirlpool', 'sha384', 'md5-sha1', 'SHA224', 'blake2b512', 'MD4', 'md4'}

If we had it inside that variable, all our loaders will compute that hash though.
I don't think we want that (yet).

Instead, i'll spawn a new variable with DEFAULT_ALGORITHMS with the actual values (without blake2 that is).
This variable will be the default values for the actual utilities function (hash_data, hash_file, hash_path).

Then i update ALGORITHMS with that new blake2 (to ease the check on correct hash, see below your point my answer)

92

This test looks wrong.

Yes, i don't like it either. Thus the diff to shake brains to come up with a better implem.

Also, if what is passed after the : is not an int, this function will fail later on with a random exception, rather than with a proper "wrong hashing algo" ValueError.

Indeed.

I propose that at the beginning of this function we properly split "algo" into a pair algo/length, with length possibly None, doing all the validity checks on those and fail early if need be. The rest of the code of the function relies on the parsing having already been done.

Transforming ALGORITHMS from a set to a map, like:

ALGORITHMS = {
    'sha1': 'sha1',
    'sha256': 'sha256',
    'sha1_git': 'sha1',
    'blake2:256': 'blake2s256',
    'blake2:512': 'blake2b512',
}

would greatly simplify.

This way no need to split or do some substring computation (as per sha1_git).
We keep the _git check for sha1_git but reuse the base_algo already retrieved.

As a note, i need to check for the ALGORITHMS use in the higher module though (it's used already).

ardumont edited edge metadata.

swh.model.hashutil: Adaptations according to latest discussion

ardumont added inline comments.
swh/model/hashutil.py
92

Finally, after some discussion, simply adding blake2s256 and blake2b512 in ALGORITHMS should be enough.

ardumont marked an inline comment as done.

Fix typo in docstring

Use pyblake2 dependency on python3 <= 3.4

python3-blake2 is packaged within swh at https://forge.softwareheritage.org/source/python3-blake2/

Fix variable visibility + clarify comments on pyblake2

swh/model/hashutil.py
50

pyblake2 is the implementation used in version from 3.5 onwards (according to latest documentation).
As those functions already respects the hashlib api, we only need them to be registered in hashlib's internal model representation (for version prior to 3.4).

Tested in worker01 (jessie) with python3-blake2 packaged for the occasion.

ardumont@worker01:~% sudo apt-get install python3-pyblake2
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  ruby-hiera
Use 'apt-get autoremove' to remove it.
The following NEW packages will be installed:
  python3-pyblake2
0 upgraded, 1 newly installed, 0 to remove and 2 not upgraded.
Need to get 0 B/16.8 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Selecting previously unselected package python3-pyblake2.
(Reading database ... 60676 files and directories currently installed.)
Preparing to unpack .../python3-pyblake2_0.9.3-1_amd64.deb ...
Unpacking python3-pyblake2 (0.9.3-1) ...
Setting up python3-pyblake2 (0.9.3-1) ...
ardumont@worker01:~% # adapt installed swh.model.hashutil with this snippet
ardumont@worker01:~% python3
Python 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from swh.model import hashutil           # registers new blake2 hash algo in hashlib
>>> h = hashutil._new_hash('blake2b512')  # uses hashlib.new as implem. detail
>>> h.update(b'hello')
>>> h.digest()
b'\xe4\xcf\xa3\x9a=7\xbe1\xc5\x96\t\xe8\x07\x97\x07\x99\xca\xa6\x8a\x19\xbf\xaa\x15\x13_\x16P\x85\xe0\x1dA\xa6[\xa1\xe1\xb1F\xae\xb6\xbd\x00\x92\xb4\x9e\xac!L\x10<\xcf\xa3\xa3e\x95K\xbb\xe5/t\xa2\xb3b\x0c\x94'
>>> import hashlib    # using directly hashlib (registering already done)
>>> h = hashlib.new('blake2b512')
>>> h.update(b'hello')
>>> h.digest()
b'\xe4\xcf\xa3\x9a=7\xbe1\xc5\x96\t\xe8\x07\x97\x07\x99\xca\xa6\x8a\x19\xbf\xaa\x15\x13_\x16P\x85\xe0\x1dA\xa6[\xa1\xe1\xb1F\xae\xb6\xbd\x00\x92\xb4\x9e\xac!L\x10<\xcf\xa3\xa3e\x95K\xbb\xe5/t\xa2\xb3b\x0c\x94'            # same result \o/
ardumont@worker01:~% # revert back swh.model.hashutil
ardumont@worker01:~% # remove python3-pyblake2 package

From my machine (testing):

$ python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib     # no registering as this is native
>>> h = hashlib.new('blake2b512')
>>> h.update(b'hello')
>>> h.digest()
b'\xe4\xcf\xa3\x9a=7\xbe1\xc5\x96\t\xe8\x07\x97\x07\x99\xca\xa6\x8a\x19\xbf\xaa\x15\x13_\x16P\x85\xe0\x1dA\xa6[\xa1\xe1\xb1F\xae\xb6\xbd\x00\x92\xb4\x9e\xac!L\x10<\xcf\xa3\xa3e\x95K\xbb\xe5/t\xa2\xb3b\x0c\x94'

Same result \o/.
This works as entertained.

Rework and clarify commit messages

  • swh.model.hashutil: Open variable length hash algorithm support
  • swh.model.hashutil: Simplify length hash algorithms instantiation
  • swh.model.hashutil: Make unknown variable length algo creation break
  • swh.model.hashutil: Adapt according to latest discussion
  • swh.model.hashutil: Use pyblake2 dependency on python3 <= 3.4