Analyze hash collisions
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Mar 24 2020, 12:01 PM

Description

Currently we store contents.
When hitting a hash collision, we do not reference the next content colliding with a previously already stored ones.

All in all, this task serves the purpose of being sure those exists.
And what can we do to avoid losing information.

Related to T2019

Revisions and Commits

rDSTO Storage manager
	Closed	D2977 Prevent erroneous HashCollisions by using the same ctime for all rows.

Related Objects

Mentioned In: D2977: Prevent erroneous HashCollisions by using the same ctime for all rows.
T2352: nixguix: Fails to finish as it's stuck in a loop up to memory error
rDSNIP37434e3ec5e6: analyse_hash_collision: Deal with new exception format
rDSTOfd29fcbecf5f: storage*: Hex encode content hashes in HashCollision exception
rDSNIP170a3d9da982: Allow ctime/date reported by sentry comparison
D2872: storage*: Hex encode content hashes in HashCollision exception
Mentioned Here: D2977: Prevent erroneous HashCollisions by using the same ctime for all rows.
T2019: race condition during concurrent loading of the same objects from multiple origins
D2872: storage*: Hex encode content hashes in HashCollision exception
D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception
P622 Current real collisions analysis sample (out of current storage vs sentry issues api output)

Event Timeline

sampled collisions extracted from sentry and storage [1]

[1] P622

Related to D2783

I'll write my remarks down here for tracking purposes

it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs
there's a few of what looks like false positives; most hashes are duplicates, save from one.

For these false positives, the hash in the database contains two 0x25 bytes next to one another; the "colliding" object only has a single 0x25 byte (and therefore a hash of the wrong length).

Interpreted as ascii, [0x25, 0x25] is "%%"

"%%" is an escape that can turn into "%" in a few situations:

python %-encoding
sql LIKE operations

"%"s are also special in some HTTP operations; maybe something wrongly handles "%%" encoded as "%25%25" and ends up storing only "%".

Right now, I'm not convinced that the storage ever sees the objects with the mismatched hashes. To be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.

I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs

Yes, 3.

there's a few of what looks like false positives; most hashes are duplicates, save from one.

Yes, i have updated the paste to separate the falsy ones from the real ones.
That should be clearer now ;)

ardumont mentioned this in D2872: storage*: Hex encode content hashes in HashCollision exception.Mar 24 2020, 1:34 PM

ardumont mentioned this in rDSNIP170a3d9da982: Allow ctime/date reported by sentry comparison.Mar 24 2020, 2:56 PM

to be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.

That's D2872.

I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.

cooking...
cooked...

They are quite close indeed (order or minutes apart).
P622 updated, the stored-content has its ctime displayed and the sentry one as a field 'date-reported-by-sentry' which corresponds to the dateCreated field filed in the event sentry api call output

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

That's the case.

It's declared in the pg schema [1]

and it's installed in the db.

$ psql service=swh
softwareheritage=> \dT+ blake2s256
                                          List of data types
 Schema |    Name    | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+------------+---------------+------+----------+------------+-------------------+-------------
 public | blake2s256 | blake2s256    | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha1
                                       List of data types
 Schema | Name | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+------+---------------+------+----------+------------+-------------------+-------------
 public | sha1 | sha1          | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha1_git
                                         List of data types
 Schema |   Name   | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+----------+---------------+------+----------+------------+-------------------+-------------
 public | sha1_git | sha1_git      | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha256
                                        List of data types
 Schema |  Name  | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+--------+---------------+------+----------+------------+-------------------+-------------
 public | sha256 | sha256        | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \d content
                                          Table "public.content"
   Column   |           Type           | Collation | Nullable |                  Default
------------+--------------------------+-----------+----------+--------------------------------------------
 sha1       | sha1                     |           | not null |
 sha1_git   | sha1_git                 |           | not null |
 sha256     | sha256                   |           | not null |
 length     | bigint                   |           | not null |
 ctime      | timestamp with time zone |           | not null | now()
 status     | content_status           |           | not null | 'visible'::content_status
 object_id  | bigint                   |           | not null | nextval('content_object_id_seq'::regclass)
 blake2s256 | blake2s256               |           |          |
Indexes:
    "content_pkey" PRIMARY KEY, btree (sha1)
    "content_object_id_idx" UNIQUE, btree (object_id)
    "content_sha1_git_idx" UNIQUE, btree (sha1_git)
    "content_blake2s256_idx" btree (blake2s256)
    "content_ctime_idx" btree (ctime)
    "content_sha256_idx" btree (sha256)
Publications:
    "softwareheritage"

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/30-swh-schema.sql$22-32

ardumont mentioned this in rDSTOfd29fcbecf5f: storage*: Hex encode content hashes in HashCollision exception.Mar 25 2020, 9:50 AM

In T2332#42825, @ardumont wrote:

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

That's the case.

*phew* thanks for checking ;)

ardumont mentioned this in rDSNIP37434e3ec5e6: analyse_hash_collision: Deal with new exception format.Apr 3 2020, 7:01 PM

All in all, this task serves the purpose of being sure those exists.

They do exist.

They are not frequent, only 3.
I have updated the P622 summary.

And what can we do to avoid losing information.

There has been discussions about it, i don't remember all though, here is my summary attempt:

storage: add a new "skipped_content" status 'collision' entry (does not deal with the objstorage)
objstorage: change completely the hashes used to store content objects
objstorage: use sha1dc as key
objstorage: use a "multiplexer" pattern to store on different key algo in case of collision

Feel free to correct and or update the list ;)

ardumont mentioned this in T2352: nixguix: Fails to finish as it's stuck in a loop up to memory error.Apr 7 2020, 11:49 AM

ardumont mentioned this in D2977: Prevent erroneous HashCollisions by using the same ctime for all rows..Apr 8 2020, 10:36 AM

vlorentz added a revision: D2977: Prevent erroneous HashCollisions by using the same ctime for all rows..Apr 8 2020, 10:53 AM

An interesting experiment, disabling the proxy buffer storage in the loader nixguix configuration.
And the number of hashcollision dropped to 0 (no new event for that loader since yesterday around 6pm our time).

(forgot to submit)

ardumont updated the task description. (Show Details)Apr 8 2020, 4:15 PM

So our high number of falsy hash collisions is fixed thanks to D2977 now \m/.

Remains open because there remain decision to be made
about the few real ones (3) we have so far [1]

[1] T2332#43125

This task has been migrated to GitLab.

Analyze hash collisionsClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Analyze hash collisions
Closed, MigratedEdits Locked
Actions