loader core should not rely on SHA1 only to decide whether some content is missing or not
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Jul 28 2016, 12:49 PM

Description

The culprit lines seem to use only sha1 to decide whether some content is missing. So, in case of sha1 collision, we might end up not adding something to the storage, but believing we have done so.
Instead, we should always check for the presence of something using all checksums at once, as we have discussed many times during the initial design of loaders.

This is not problematic yet, as it has not been deployed in production for any VCS yet. (The Git loader always uses the add API entrypoint of storage, which uses all checksums at once and will bail in case of collisions.) But it is a blocker for deploying loaders that use the base loader (e.g., SVN).

Also, we need to review if that code was around at the time we did the initial test injection of GNU tarballs and Debian Source packages. If so, we should double check our local mirrors for SHA1 collisions.

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T617 ingest Google Code Subversion repositories
		Unknown Object (Maniphest Task)
		Unknown Object (Maniphest Task)
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T328 svn / subversion loader
Migrated	gitlab-migration	T519 loader core should not rely on SHA1 only to decide whether some content is missing or not

Event Timeline

zack created this task.Jul 28 2016, 12:49 PM

zack added a parent task: T328: svn / subversion loader.

The culprit lines seem to use only sha1 to decide whether some content is missing...

No, we use all the checksums.
I think the key_hash parameter of content_missing's function is misleading.
content_missing uses all the content's checksums to determine if a content is missing or not and return the key_hash column of those missing (here the sha1).

Maybe renaming key_hash to return_keyhash (or missing_keyhash or something) seems reasonable?

Also, we need to review if that code was around at the time we did the initial test injection of GNU tarballs and Debian Source packages. If so, we should double check our local mirrors for SHA1 collisions.

This was not around for those initial injections.

ardumont closed this task as Invalid.Aug 17 2016, 10:44 AM

ardumont mentioned this in T1603: kafka storage backfiller.Apr 2 2019, 12:10 PM

This task has been migrated to GitLab.

loader core should not rely on SHA1 only to decide whether some content is missing or notClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

loader core should not rely on SHA1 only to decide whether some content is missing or not
Closed, MigratedEdits Locked
Actions

Related Objects
Search...