Page MenuHomeSoftware Heritage

Dealing with repositories with contents that produces hash conflicts (example included from GitLab)
Open, HighPublic

Description

The (in)famous two different files with same length and same SHA1 (SHAttered) is being included as a test in cryptography related projects. An example showed up as a result of a failure to load the https://gitlab.com/sequoia-pgp/sequoia repository, that contains such files.

$ git clone https://gitlab.com/sequoia-pgp/sequoia
[...]
$ cd sequoia/openpgp/tests/data/messages
$ sha1sum shattered-[12].pdf 
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf

It turns out that this does not pose a problem for git, nor for our SWHIDv1, as the SHA1 conflicting files do not produce a SHA1-git conflict: indeed, these files are properly stored in the sequoia project.

$ git hash-object shattered-[12].pdf
ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0
b621eeccd5c7edac9b7dcba35a8d5afd075e24f2

But our current pipeline detects the SHA1 conflict and prevents their ingestion.

We need to design a way to archive such repositories, instead of skipping like we do today.

Related Objects

Mentioned In
T3134: SWHID v2
Mentioned Here
T3134: SWHID v2

Event Timeline

rdicosmo created this task.

Loading the repository in docker environment gives me the following traceback:

swh-loader_1                        | [2021-12-07 12:08:06,876: INFO/MainProcess] Task swh.loader.git.tasks.UpdateGitRepository[a1aa28c0-1cb0-4e2a-8ae2-720ba6ca439e] received
swh-loader_1                        | [2021-12-07 12:08:06,877: INFO/MainProcess] loader@b11bfd448510 ready.
swh-loader_1                        | [2021-12-07 12:08:06,957: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1                        | [2021-12-07 12:08:09,904: INFO/ForkPoolWorker-1] Load origin 'https://gitlab.com/sequoia-pgp/sequoia' with type 'git'
swh-loader_1                        | [2021-12-07 12:08:09,908: DEBUG/ForkPoolWorker-1] Transport url to communicate with server: https://gitlab.com/sequoia-pgp/sequoia
swh-loader_1                        | [2021-12-07 12:08:09,909: DEBUG/ForkPoolWorker-1] Client Urllib3HttpGitClient('https://gitlab.com/sequoia-pgp/sequoia/', dumb=None) to fetch pack at /sequoia-pgp/sequoia
swh-loader_1                        | [2021-12-07 12:08:10,422: DEBUG/ForkPoolWorker-1] local_heads_count=0
swh-loader_1                        | [2021-12-07 12:08:10,422: DEBUG/ForkPoolWorker-1] remote_heads_count=1821
swh-loader_1                        | [2021-12-07 12:08:10,422: DEBUG/ForkPoolWorker-1] wanted_refs_count=1821
swh-loader_1                        | [2021-12-07 12:09:17,112: ERROR/ForkPoolWorker-1] Loading failure, updating to `failed` status
swh-loader_1                        | Traceback (most recent call last):
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 29, in raise_for_status
swh-loader_1                        |     super().raise_for_status(response)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 344, in raise_for_status
swh-loader_1                        |     raise exception from None
swh-loader_1                        | swh.core.api.RemoteException: <RemoteException 500 HashCollision: ['sha1', '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', [{'blake2s256': '30e4bd16c3f98e74429d237c19ca9def702e5720cb124cb4b92e74f989aaf116', 'sha1': '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', 'sha1_git': 'b621eeccd5c7edac9b7dcba35a8d5afd075e24f2', 'sha256': 'd4488775d29bdef7993367d541064dbdda50d383f89f0aa13a6ff2e0894ba5ff'}, {'blake2s256': '8f677e3214ca8b2acad91884a1571ef3f12b786501f9a6bedfd6239d82095dd2', 'sha1': '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', 'sha1_git': 'ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0', 'sha256': '2bb787a73e37352f92383abe7e2902936d1059ad9f1ba6daaa9c1e58ee6970d0'}]]>
swh-loader_1                        | 
swh-loader_1                        | During handling of the above exception, another exception occurred:
swh-loader_1                        | 
swh-loader_1                        | Traceback (most recent call last):
swh-loader_1                        |   File "/src/swh-loader-core/swh/loader/core/loader.py", line 339, in load
swh-loader_1                        |     self.store_data()
swh-loader_1                        |   File "/src/swh-loader-core/swh/loader/core/loader.py", line 458, in store_data
swh-loader_1                        |     self.storage.directory_add([directory])
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 171, in directory_add
swh-loader_1                        |     stats = self.object_add(directories, object_type="directory", keys=["id"])
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 224, in object_add
swh-loader_1                        |     return self.flush()
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 286, in flush
swh-loader_1                        |     stats = add_fn(list(batch))
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 58, in content_add
swh-loader_1                        |     [x for x in content if x.sha256 in contents_to_add]
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 45, in content_add
swh-loader_1                        |     return self.post("content/add", {"content": content})
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 278, in post
swh-loader_1                        |     return self._decode_response(response)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 354, in _decode_response
swh-loader_1                        |     self.raise_for_status(response)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 39, in raise_for_status
swh-loader_1                        |     raise HashCollision(*e.args[0]["args"])
swh-loader_1                        | swh.storage.exc.HashCollision: ('sha1', '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', [{'sha256': 'd4488775d29bdef7993367d541064dbdda50d383f89f0aa13a6ff2e0894ba5ff', 'sha1': '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', 'sha1_git': 'b621eeccd5c7edac9b7dcba35a8d5afd075e24f2', 'blake2s256': '30e4bd16c3f98e74429d237c19ca9def702e5720cb124cb4b92e74f989aaf116'}, {'sha256': '2bb787a73e37352f92383abe7e2902936d1059ad9f1ba6daaa9c1e58ee6970d0', 'sha1': '38762cf7f55934b34d179ae6a4c80cadccbb7f0a', 'sha1_git': 'ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0', 'blake2s256': '8f677e3214ca8b2acad91884a1571ef3f12b786501f9a6bedfd6239d82095dd2'}])

So we have a hash collision between two contents.

If we look at the content of the openpgp/tests/data/messages directory in the repository, we can see that it contains the files shattered-1.pdf and shattered-2.pdf which are examples of the concrete collision attack against SHA-1.

Thanks a lot @anlambert for looking into this.
It is possible that more key cryptographic software will include these files.
We need a strategy to handle this situation, may you add this example to the SWHID v2 task?

It is possible that more key cryptographic software will include these files.

Yes this is not the first time we stumbled across that issue, a good amount of repositories contain thoses files.

We need a strategy to handle this situation, may you add this example to the SWHID v2 task?

Sure I linked that issue in the task (T3134#75077).

rdicosmo renamed this task from Check failures in save code now requests for GitLab to Dealing with repositories with contents that produces hash conflicts (example included from GitLab).Dec 8 2021, 3:16 PM
rdicosmo updated the task description. (Show Details)
rdicosmo added subscribers: olasd, zack.

Updated task name and description to reflect the findings from @anlambert

The main issue that prevents us from archiving these objects today is that our object storage still uses a plain sha1 as primary key (hence the current unicity constraint on the sha1 field of the content table in our primary graph storage).

(we also have a unicity constraint on sha1_git, because that's used as content "swhid v1"s and as outbound edges within the other layers of the graph to refer to unique contents. That unicity constraint wouldn't break loading these origins, as you've noticed)

For the ceph-based objstorage, we currently plan to use the (intrinsic) sha256 as key.

The current usage of SHA1 is hardcoded at the swh.storage level: the content-related APIs in swh.storage hardcode the use of the sha1 field to do queries to the underlying swh.objstorage instance. There's a few bits and bobs of hardcoded sha1 at the swh.objstorage level too.

We won't be blanket-migrating all our storages from a sha1 primary key to a sha256 primary key in any (sensible) amount of time, so we'll need to support a (multi-month) transition period where the objstorage servers will need to respond to queries where clients know both object identifiers. This needs an API change on the objstorage side, and probably layout adaptations for the cloud object storages?

Once that API and new layout on the objstorage side is settled, then we can start writing objects to new sha256-based object storages, which will allow us to relieve the sha1 unicity constraint on the main storage (turning it into a sha256 unicity constraint?).

I note that this side-steps completely the question whether a standalone sha256 primary key for the object storage is sensible and future-proof, but that seems to be the design we're going for in the new object storage.