Page MenuHomeSoftware Heritage

storage(s): Identify and provide the collision hash(es) in HashCollision exception
ClosedPublic

Authored by ardumont on Mar 6 2020, 2:09 PM.

Details

Summary

Storage clients may need to have details about what's colliding during adding
contents. To determine how to overcome the issue for example [1]

Furthermore, this unifies the inconsistent behavior in the 3 storage backends [2]

Now all storage backends provides the HashCollision exception with the following parameter signatures:
(algo: str, hash_id: bytes, content_hashes: Dict[str, bytes])

I'm not too happy about the implementation though.
I did not find a better way than to parse a specific exception message from
psycopg2 (e.diag.message_detail)... ouch.

[1] https://forge.softwareheritage.org/D2777#66365

[2] Sample output for the collided content prior to this diff (pdb breakpoint during integration test)

in-memory:
(('blake2s256', b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d"), ('sha1', b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89'), ('sha1_git', b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd'), ('sha256', b'h6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:'))

cassandra:
[Row(sha1_git=b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd', sha1=b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89', sha256=b'g6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:', blake2s256=b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d"), Row(sha1_git=b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd', sha1=b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89', sha256=b'h6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:', blake2s256=b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d")]

pgstorage:
Nothing
Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11008
Build 16573: tox-on-jenkinsJenkins
Build 16572: arc lint + arc unit

Event Timeline

swh/storage/utils.py
86

This may go away altogether if we use the BaseContent.hashes method which returns the equivalent Dict[str, bytes] of content hashes.

ardumont edited the summary of this revision. (Show Details)

Why List[Tuple[str, bytes]] instead of Dict[str, bytes]?

This is very hacky but I don't see any alternative (not even using D2248), so let's do that.

Why List[Tuple[str, bytes]] instead of Dict[str, bytes]?

Because i don't really know what's acceptable as type there.
And also, i only thought of checking the model *after* implementing this ;)

I'm all for doing the dict implementation instead.
And at the same time unifying the other storage backends.
Thanks.

This is very hacky but I don't see any alternative (not even using D2248), so let's do that.

yup
but now at least we have information on the content colliding ;)

Unify behavior to all backend storage to Provide the colliding content hashes

ardumont retitled this revision from storage: Identify and provide the collision hash(es) in HashCollision exception to storage(s): Identify and provide the collision hash(es) in HashCollision exception.Mar 6 2020, 11:54 PM
ardumont edited the summary of this revision. (Show Details)
ardumont added a project: Storage manager.

lgtm, but I'd like someone else to review it as well

swh/storage/storage.py
167

c.get_hash(hash_name)

This revision is now accepted and ready to land.Mar 7 2020, 9:23 AM
ardumont edited the summary of this revision. (Show Details)

Adapt according to review

Build is green

Cancel the previous build stuck since yesterday for some reason.
Trigger back the build and it's all good now.

swh/storage/utils.py
86

comment deprecated (for the first diff implementation on a method no longer existing now).

olasd requested changes to this revision.Mar 9 2020, 11:35 AM

As mentioned inline in the pg storage diff, in general we should return /all/ colliding contents that we can find, rather than a single one. So in the end, the exception argument should be a List[Dict[str, bytes]].

swh/storage/cassandra/storage.py
92–102

Am I right that this logic means that we add colliding contents to the storage, /then/ report a collision? We should probably flip the main table add and the index check.

swh/storage/storage.py
168

Why pick a single one instead of all colliding contents?

swh/storage/utils.py
61–62

I suggest using re.search on r'\((?P<type>[^)]+)\)=\(\\x(?P<id>[a-f0-9]+)\), then you can drop the replace lower down.

This revision now requires changes to proceed.Mar 9 2020, 11:35 AM
swh/storage/cassandra/storage.py
92–102

Correct. That's intentional to not give the illusion the "test and insert" is atomic.
IMO it shouldn't raise the exception at all, it's only to mimic the pgstorage's behavior.

swh/storage/cassandra/storage.py
92–102

Even when the test + insert behavior doesn't exhibit a race condition, this choice:

  • voluntarily lets data through that creates an ambiguity between storage and objstorage (by allowing the insertion of contents with colliding sha1s).
  • voluntarily lets data through that creates an ambiguity in edges of the archive graph (by allowing the insertion of contents with colliding sha1_gits).

That feels... quite bad.

Given that we're not even close to the point of migrating to another hash for either of these things, we're very far away from not raising the exception at all, and we should consider avoiding inserting colliding data /at the very least/ to the main table, if we can help it.

swh/storage/storage.py
168

Right, I'll fix that.

swh/storage/utils.py
61–62

I remember trying to match the '\\x' and fail to make it work.
I could not find the right amount of escaping.

I'll try again ;)

swh/storage/utils.py
61–62

Ok, those tests were done prior to me remembering about the r'' stanza.
Now it works \m/

  • pgstorage: Return the list of colliding content hashes
  • improve regexp extraction

Align storages to return the list of colliding hashes

swh/storage/cassandra/storage.py
92–102

Ack, i'm keeping this out of the scope of the diff though.

swh/storage/cassandra/converters.py
75

jsyk i did a test but that types error

$ pytest -x -s -k test_row_to_content_hashes
...
TypeError: tuple() takes no keyword arguments

The test in question:

+# Copyright (C) 2020  The Software Heritage developers
+# See the AUTHORS file at the top-level directory of this distribution
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+from swh.storage.cassandra.common import Row
+from swh.storage.cassandra import converters
+
+from swh.model.hashutil import DEFAULT_ALGORITHMS
+
+
+def test_row_to_content_hashes():
+    for row in [Row(
+            sha1=b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89',
+            sha1_git=b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd',  # noqa
+            sha256=b'g6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:',  # noqa
+            blake2s256=b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d",  # noqa
+    ),
+                Row(
+                    sha1=b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89',  # noqa
+                    sha1_git=b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd',  # noqa
+                    sha256=b'h6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:',  # noqa
+                    blake2s256=b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d",  # noqa
+                ),
+    ]:
+        actual_hashes = converters.row_to_content_hashes(row)
+
+        assert len(actual_hashes) == len(DEFAULT_ALGORITHMS)
+        for algo in DEFAULT_ALGORITHMS:
+            assert actual_hashes[algo] == getattr(row, algo)

It complains about tuple (Row is a type alias for tuple) that cannot have named fields.
Somehow, in the current execution context, the cassandra Row in question is more than a tuple, it seems:

context:

$ pytest -x -s -k test_content_add_collision
(Pdb++) pks[0]
Row(sha1=b'4\x972t\xcc\xefj\xb4\xdf\xaa\xf8e\x99y/\xa9\xc3\xfeF\x89', sha1_git=b'\xd8\x1c\xc0q\x0e\xb6\xcf\x9e\xfd[\x92\n\x84S\xe1\xe0qW\xb6\xcd', sha256=b'g6P\xf96\xcb;\n/\x93\xce\t\xd8\x1b\xe1\x07H\xb1\xb2\x03\xc1\x9e\x81v\xb4\xee\xfc\x19d\xa0\xcf:', blake2s256=b"\xd5\xfe\x199We'\xe4,\xfdv\xa9EZ$2\xfe\x7fVf\x95dW}\xd9<B\x80\xe7mf\x1d")
(Pdb++) r = pks[0]
(Pdb++) from swh.storage.cassandra.common import Row
(Pdb++) isinstance(r, Row)
True
(Pdb++) r.__class__
<class 'cassandra.io.libevreactor.Row'>

so i don't know how to actually test this...

swh/storage/cassandra/converters.py
75

The Row objects returned by cassandra-driver are namedtuples generated on the fly; I guess swh.storage.cassandra.common's Row = tuple is just a typechecking helper. For your tests, I guess you'll want to create a new namedtuple class with the proper attributes.

swh/storage/cassandra/storage.py
92–102

Of course, I just noticed it while reviewing this code.

swh/storage/cassandra/converters.py
75

Correct. I recomment row._asdict(), I already use it everywhere.

Add coverage on extra conversion step

swh/storage/cassandra/converters.py
75

That works but that does not typecheck...

swh/storage/cassandra/converters.py:73: error: "Tuple[Any, ...]" has no attribute "_asdict"

so i'll keep that version.

vlorentz requested changes to this revision.Mar 9 2020, 4:43 PM

Could you add more assertions to test_content_add_collision and test_content_add_metadata_collision, to check for the new common behavior?

This revision now requires changes to proceed.Mar 9 2020, 4:43 PM

Could you add more assertions to test_content_add_collision and test_content_add_metadata_collision, to check for the new common behavior?

Sure

Improve collision scenario checks

This revision was not accepted when it landed; it landed in state Needs Review.Mar 10 2020, 10:54 AM
This revision was automatically updated to reflect the committed changes.