Page MenuHomeSoftware Heritage

merkle: Make MerkleNode.collect return a set of nodes instead of a dict
ClosedPublic

Authored by anlambert on Oct 14 2022, 4:00 PM.

Details

Summary

Previously the MerkleNode.collect method was returning a dict whose keys
are node types and values dict of {<node_hash>: <node_data>}.

In order to give more flexibility to client code for the processing of
collected nodes, prefer to simply return a set of MerkleNode.

As a consequence, MerkleNode objects need to be hashable by Python so
add a new abstract method hash_to_int for converting hash value to
integer that derived classes must implement.

Closes T4633

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8686 (id=31361)

Rebasing onto 9b8beef158...

Current branch diff-target is up to date.
Changes applied before test
commit cb5acb2d3e45f23bb766f2cdd941b1888b8ee651
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 14 15:54:07 2022 +0200

    merkle: Make MerkleNode.collect return a set of nodes instead of a dict
    
    Previously the MerkleNode.collect method was returning a dict whose keys
    are node types and values dict of {<node_hash>: <node_data>}.
    
    In order to give more flexibility to client code for the processing of
    collected nodes, prefer to simply return a set of MerkleNode.
    
    As a consequence, MerkleNode objects need to be hashable by Python so
    add a new abstract method hash_to_int for converting hash value to
    integer that derived classes must implement.
    
    Closes T4633

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/537/ for more details.

Thanks for taking this suggestion in stride!

We'll need a breaking release of swh.model, but I don't think there were any users of these methods (left).

swh/model/merkle.py
108

That way we avoid giving all subclasses a new method.

For the object to be properly hashable (in terms of Python) we need to implement __eq__ too.

swh/model/tests/test_merkle.py
176

When we implement __eq__ on MerkleNode, this might work.

249–250

can now be an assertEqual

swh/model/merkle.py
108

That way we avoid giving all subclasses a new method.

Ah right, I forgot about that builtin.

For the object to be properly hashable (in terms of Python) we need to implement eq too.

There is already an __eq__ method implemented as follow:

def __eq__(self, other):
    return (
        isinstance(other, MerkleNode)
        and super().__eq__(other)
        and self.data == other.data
    )

But comparing hash values is faster that comparing nodes data so better update it.

vlorentz added inline comments.
swh/model/merkle.py
108

It's also prone to collisions, because it's a sha1_git.

swh/model/merkle.py
108

Ah good point, better keeping the current implementation then.

Update:

  • use hash builtin instead of adding a new hash_to_int method
  • update tests

Build is green

Patch application report for D8686 (id=31370)

Rebasing onto 9b8beef158...

Current branch diff-target is up to date.
Changes applied before test
commit 13e7adc3e85424e808189cfb2cf242364d57c928
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Fri Oct 14 15:54:07 2022 +0200

    merkle: Make MerkleNode.collect return a set of nodes instead of a dict
    
    Previously the MerkleNode.collect method was returning a dict whose keys
    are node types and values dict of {<node_hash>: <node_data>}.
    
    In order to give more flexibility to client code for the processing of
    collected nodes, prefer to simply return a set of MerkleNode.
    
    As a consequence, MerkleNode objects need to be hashable by Python so
    the __hash__ method has also been implemented.
    
    Closes T4633

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/538/ for more details.

This revision is now accepted and ready to land.Oct 17 2022, 2:58 PM