Page MenuHomeSoftware Heritage

storage*: Hex encode content hashes in HashCollision exception
ClosedPublic

Authored by ardumont on Tue, Mar 24, 1:34 PM.

Details

Summary

This encodes the collision's referenced colliding hashes into hex hashes.
This does the encoding at HashCollision instanciation time.

This provides a content_colliding_hashes method to retrieve the hashes as a dict of bytes.
This is mainly for client which captures the exception and wants to do something with it.
For example the replayer does capture the colliding hashes, log them and put them aside to reapply the transaction without those.

Note that it also moves the HashCollision exception to the swh.storage.exc module.

Related to T2332#42793

Impact:

  • swh.journal: retry behavior which puts aside colliding hashes D2874
Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Tue, Mar 24, 1:34 PM
vlorentz added a subscriber: vlorentz.EditedTue, Mar 24, 1:38 PM

I understand this is to work around the sentry issue, but I think it's a bad idea. We don't want parts of the API to use hex encoding while everything else uses bytes.

What about using log.error instead (which is caught by sentry_sdk just like exceptions)?

olasd added a subscriber: olasd.Tue, Mar 24, 2:32 PM

I understand this is to work around the sentry issue, but I think it's a bad idea. We don't want parts of the API to use hex encoding while everything else uses bytes.

What about using log.error instead (which is caught by sentry_sdk just like exceptions)?

I think there's general value in keeping exception arguments inambiguous and human-readable.

Having to fish out and copy/paste a repr'd bytes object is really annoying when you want to quickly check whether something is in the database or not, for instance. Also, I don't trust our exception handling utilities (be that logging, sentry, serialization in the RPC protocol, ...) to handle arbitrary bytes unharmed.

As for your proposal, we'd need to call log.error and explicitly decode the arguments in all places which can raise a HashCollision; that doesn't sound very appealing

olasd added a comment.EditedTue, Mar 24, 2:37 PM

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)
ardumont edited the summary of this revision. (Show Details)Tue, Mar 24, 3:11 PM
In D2872#69252, @olasd wrote:

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)

Something like:

class HashCollision(Exception):
    """Exception raised when a content collides in a storage backend

    """
    def __init__(self, algo, hash_id, colliding_contents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.algo = algo
        self.hash_id = hash_to_hex(hash_id)
        self.colliding_contents = [content_hex_hashes(c)
                                   for c in colliding_contents]

    def colliding_contents(self) -> Dict[str, bytes]:
        return [content_bytes_hashes(c) for c in self.colliding_contents]

?

olasd added a comment.Tue, Mar 24, 3:25 PM
In D2872#69252, @olasd wrote:

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)

Something like:

class HashCollision(Exception):
    """Exception raised when a content collides in a storage backend

    """
    def __init__(self, algo, hash_id, colliding_contents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.algo = algo
        self.hash_id = hash_to_hex(hash_id)
        self.colliding_contents = [content_hex_hashes(c)
                                   for c in colliding_contents]

    def colliding_contents(self) -> Dict[str, bytes]:
        return [content_bytes_hashes(c) for c in self.colliding_contents]

?

Yeah, without the attribute name clash :)

ardumont updated this revision to Diff 10237.Tue, Mar 24, 3:56 PM
  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes for clients (e.g. journal)
ardumont edited the summary of this revision. (Show Details)Tue, Mar 24, 4:07 PM
ardumont updated this revision to Diff 10240.Tue, Mar 24, 4:07 PM

Fix missing adaptations on test_retry module

ardumont edited the summary of this revision. (Show Details)Tue, Mar 24, 4:12 PM
ardumont edited the test plan for this revision. (Show Details)
ardumont updated this revision to Diff 10241.Tue, Mar 24, 4:13 PM
ardumont edited the summary of this revision. (Show Details)

Add missing revert in test_utils

In D2872#69251, @olasd wrote:

I think there's general value in keeping exception arguments inambiguous and human-readable.

Having to fish out and copy/paste a repr'd bytes object is really annoying when you want to quickly check whether something is in the database or not, for instance. Also, I don't trust our exception handling utilities (be that logging, sentry, serialization in the RPC protocol, ...) to handle arbitrary bytes unharmed.

sold!

ardumont edited the summary of this revision. (Show Details)Tue, Mar 24, 6:11 PM
olasd accepted this revision.Tue, Mar 24, 6:27 PM

I'm not sure about putting content_hex_hashes/content_bytes_hashes in utils, but I don't really have a better suggestion.

I've suggested one change inline.

swh/storage/exc.py
47–48

That's compatibility with the BaseException class, rather than backwards-compatibility. I guess you could just call super().__init__(algo, hash_id, colliding_contents) instead.

I guess this would make the swh.storage.api.client de-serialization work with no change (except the import path).

This revision is now accepted and ready to land.Tue, Mar 24, 6:27 PM
ardumont added inline comments.Tue, Mar 24, 6:38 PM
swh/storage/exc.py
47–48

Ok, it sounded reasonable but computer says no.

except if i change the constructor to:

self.algo = algo
self.hash_id = hash_to_hex(hash_id)
self.colliding_contents = [content_hex_hashes(c)
                           for c in colliding_contents]

super().__init__(self.algo, self.hash_id, self.colliding_contents)

I guess this would make the swh.storage.api.client de-serialization work with no change (except the import path).

Yes. I found the api_client code change quite clearer though.

ardumont updated this revision to Diff 10244.Tue, Mar 24, 6:41 PM

Adapt according to review

I'm not sure about putting content_hex_hashes/content_bytes_hashes in utils, but I don't really have a better suggestion.

Yes, i don't like it much. But i don't see anything better to do either.
Until a better idea comes up, that does the job, i guess.

swh/storage/exc.py
47–48

I did the inline change.
And reverted the api client code change.
Less moving parts ;)

ardumont edited the summary of this revision. (Show Details)Tue, Mar 24, 6:44 PM