Page MenuHomeSoftware Heritage

Analyze hash collisions
Closed, MigratedEdits Locked

Description

Currently we store contents.
When hitting a hash collision, we do not reference the next content colliding with a previously already stored ones.

All in all, this task serves the purpose of being sure those exists.
And what can we do to avoid losing information.

Related to T2019

Event Timeline

ardumont triaged this task as Normal priority.EditedMar 24 2020, 12:01 PM
ardumont created this task.

sampled collisions extracted from sentry and storage [1]

[1] P622

Related to D2783

I'll write my remarks down here for tracking purposes

  • it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs
  • there's a few of what looks like false positives; most hashes are duplicates, save from one.

For these false positives, the hash in the database contains two 0x25 bytes next to one another; the "colliding" object only has a single 0x25 byte (and therefore a hash of the wrong length).

Interpreted as ascii, [0x25, 0x25] is "%%"

"%%" is an escape that can turn into "%" in a few situations:

  • python %-encoding
  • sql LIKE operations

"%"s are also special in some HTTP operations; maybe something wrongly handles "%%" encoded as "%25%25" and ends up storing only "%".

Right now, I'm not convinced that the storage ever sees the objects with the mismatched hashes. To be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.

I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs

Yes, 3.

there's a few of what looks like false positives; most hashes are duplicates, save from one.

Yes, i have updated the paste to separate the falsy ones from the real ones.
That should be clearer now ;)

to be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.

That's D2872.

I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.

cooking...
cooked...

They are quite close indeed (order or minutes apart).
P622 updated, the stored-content has its ctime displayed and the sentry one as a field 'date-reported-by-sentry' which corresponds to the dateCreated field filed in the event sentry api call output

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

That's the case.

It's declared in the pg schema [1]

and it's installed in the db.

$ psql service=swh
softwareheritage=> \dT+ blake2s256
                                          List of data types
 Schema |    Name    | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+------------+---------------+------+----------+------------+-------------------+-------------
 public | blake2s256 | blake2s256    | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha1
                                       List of data types
 Schema | Name | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+------+---------------+------+----------+------------+-------------------+-------------
 public | sha1 | sha1          | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha1_git
                                         List of data types
 Schema |   Name   | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+----------+---------------+------+----------+------------+-------------------+-------------
 public | sha1_git | sha1_git      | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \dT+ sha256
                                        List of data types
 Schema |  Name  | Internal name | Size | Elements |   Owner    | Access privileges | Description
--------+--------+---------------+------+----------+------------+-------------------+-------------
 public | sha256 | sha256        | var  |          | swhstorage |                   |
(1 row)

softwareheritage=> \d content
                                          Table "public.content"
   Column   |           Type           | Collation | Nullable |                  Default
------------+--------------------------+-----------+----------+--------------------------------------------
 sha1       | sha1                     |           | not null |
 sha1_git   | sha1_git                 |           | not null |
 sha256     | sha256                   |           | not null |
 length     | bigint                   |           | not null |
 ctime      | timestamp with time zone |           | not null | now()
 status     | content_status           |           | not null | 'visible'::content_status
 object_id  | bigint                   |           | not null | nextval('content_object_id_seq'::regclass)
 blake2s256 | blake2s256               |           |          |
Indexes:
    "content_pkey" PRIMARY KEY, btree (sha1)
    "content_object_id_idx" UNIQUE, btree (object_id)
    "content_sha1_git_idx" UNIQUE, btree (sha1_git)
    "content_blake2s256_idx" btree (blake2s256)
    "content_ctime_idx" btree (ctime)
    "content_sha256_idx" btree (sha256)
Publications:
    "softwareheritage"

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/30-swh-schema.sql$22-32

Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.

That's the case.

*phew* thanks for checking ;)

All in all, this task serves the purpose of being sure those exists.

They do exist.

They are not frequent, only 3.
I have updated the P622 summary.

And what can we do to avoid losing information.

There has been discussions about it, i don't remember all though, here is my summary attempt:

  • storage: add a new "skipped_content" status 'collision' entry (does not deal with the objstorage)
  • objstorage: change completely the hashes used to store content objects
  • objstorage: use sha1dc as key
  • objstorage: use a "multiplexer" pattern to store on different key algo in case of collision

Feel free to correct and or update the list ;)

An interesting experiment, disabling the proxy buffer storage in the loader nixguix configuration.
And the number of hashcollision dropped to 0 (no new event for that loader since yesterday around 6pm our time).

(forgot to submit)

So our high number of falsy hash collisions is fixed thanks to D2977 now \m/.

Remains open because there remain decision to be made
about the few real ones (3) we have so far [1]

[1] T2332#43125