Page MenuHomeSoftware Heritage

storage*: Hex encode content hashes in HashCollision exception
ClosedPublic

Authored by ardumont on Mar 24 2020, 1:34 PM.

Details

Summary

This encodes the collision's referenced colliding hashes into hex hashes.
This does the encoding at HashCollision instanciation time.

This provides a content_colliding_hashes method to retrieve the hashes as a dict of bytes.
This is mainly for client which captures the exception and wants to do something with it.
For example the replayer does capture the colliding hashes, log them and put them aside to reapply the transaction without those.

Note that it also moves the HashCollision exception to the swh.storage.exc module.

Related to T2332#42793

Impact:

  • swh.journal: retry behavior which puts aside colliding hashes D2874
Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11313
Build 17107: tox-on-jenkinsJenkins
Build 17106: arc lint + arc unit

Event Timeline

I understand this is to work around the sentry issue, but I think it's a bad idea. We don't want parts of the API to use hex encoding while everything else uses bytes.

What about using log.error instead (which is caught by sentry_sdk just like exceptions)?

I understand this is to work around the sentry issue, but I think it's a bad idea. We don't want parts of the API to use hex encoding while everything else uses bytes.

What about using log.error instead (which is caught by sentry_sdk just like exceptions)?

I think there's general value in keeping exception arguments inambiguous and human-readable.

Having to fish out and copy/paste a repr'd bytes object is really annoying when you want to quickly check whether something is in the database or not, for instance. Also, I don't trust our exception handling utilities (be that logging, sentry, serialization in the RPC protocol, ...) to handle arbitrary bytes unharmed.

As for your proposal, we'd need to call log.error and explicitly decode the arguments in all places which can raise a HashCollision; that doesn't sound very appealing

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)
In D2872#69252, @olasd wrote:

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)

Something like:

class HashCollision(Exception):
    """Exception raised when a content collides in a storage backend

    """
    def __init__(self, algo, hash_id, colliding_contents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.algo = algo
        self.hash_id = hash_to_hex(hash_id)
        self.colliding_contents = [content_hex_hashes(c)
                                   for c in colliding_contents]

    def colliding_contents(self) -> Dict[str, bytes]:
        return [content_bytes_hashes(c) for c in self.colliding_contents]

?

In D2872#69252, @olasd wrote:

Instead of changing all calls to HashCollision to convert the arguments to hex, we could:

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes if we need it later (e.g. for processing, in the journal)

Something like:

class HashCollision(Exception):
    """Exception raised when a content collides in a storage backend

    """
    def __init__(self, algo, hash_id, colliding_contents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.algo = algo
        self.hash_id = hash_to_hex(hash_id)
        self.colliding_contents = [content_hex_hashes(c)
                                   for c in colliding_contents]

    def colliding_contents(self) -> Dict[str, bytes]:
        return [content_bytes_hashes(c) for c in self.colliding_contents]

?

Yeah, without the attribute name clash :)

  • convert the hashes on initialization of the exception
  • add a method to pull the "bytes" version of the hashes for clients (e.g. journal)

Fix missing adaptations on test_retry module

ardumont edited the test plan for this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

Add missing revert in test_utils

In D2872#69251, @olasd wrote:

I think there's general value in keeping exception arguments inambiguous and human-readable.

Having to fish out and copy/paste a repr'd bytes object is really annoying when you want to quickly check whether something is in the database or not, for instance. Also, I don't trust our exception handling utilities (be that logging, sentry, serialization in the RPC protocol, ...) to handle arbitrary bytes unharmed.

sold!

I'm not sure about putting content_hex_hashes/content_bytes_hashes in utils, but I don't really have a better suggestion.

I've suggested one change inline.

swh/storage/exc.py
47–48 ↗(On Diff #10241)

That's compatibility with the BaseException class, rather than backwards-compatibility. I guess you could just call super().__init__(algo, hash_id, colliding_contents) instead.

I guess this would make the swh.storage.api.client de-serialization work with no change (except the import path).

This revision is now accepted and ready to land.Mar 24 2020, 6:27 PM
swh/storage/exc.py
47–48 ↗(On Diff #10241)

Ok, it sounded reasonable but computer says no.

except if i change the constructor to:

self.algo = algo
self.hash_id = hash_to_hex(hash_id)
self.colliding_contents = [content_hex_hashes(c)
                           for c in colliding_contents]

super().__init__(self.algo, self.hash_id, self.colliding_contents)

I guess this would make the swh.storage.api.client de-serialization work with no change (except the import path).

Yes. I found the api_client code change quite clearer though.

I'm not sure about putting content_hex_hashes/content_bytes_hashes in utils, but I don't really have a better suggestion.

Yes, i don't like it much. But i don't see anything better to do either.
Until a better idea comes up, that does the job, i guess.

swh/storage/exc.py
47–48 ↗(On Diff #10241)

I did the inline change.
And reverted the api client code change.
Less moving parts ;)