Page MenuHomeSoftware Heritage

loader core should not rely on SHA1 only to decide whether some content is missing or not
Closed, MigratedEdits Locked

Description

The culprit lines seem to use only sha1 to decide whether some content is missing. So, in case of sha1 collision, we might end up not adding something to the storage, but believing we have done so.
Instead, we should always check for the presence of something using all checksums at once, as we have discussed many times during the initial design of loaders.

This is not problematic yet, as it has not been deployed in production for any VCS yet. (The Git loader always uses the add API entrypoint of storage, which uses all checksums at once and will bail in case of collisions.) But it is a blocker for deploying loaders that use the base loader (e.g., SVN).

Also, we need to review if that code was around at the time we did the initial test injection of GNU tarballs and Debian Source packages. If so, we should double check our local mirrors for SHA1 collisions.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

The culprit lines seem to use only sha1 to decide whether some content is missing...

No, we use all the checksums.
I think the key_hash parameter of content_missing's function is misleading.
content_missing uses all the content's checksums to determine if a content is missing or not and return the key_hash column of those missing (here the sha1).

Maybe renaming key_hash to return_keyhash (or missing_keyhash or something) seems reasonable?

Also, we need to review if that code was around at the time we did the initial test injection of GNU tarballs and Debian Source packages. If so, we should double check our local mirrors for SHA1 collisions.

This was not around for those initial injections.