Migrate the content store to a new (internal) primary key scheme
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Mar 3 2017, 3:37 PM

Description

Our content store, as well as all the associated data, uses the sha1 of the data as a primary key.

This content store primary key is used in several places, notably :

As an accessor in our common object storage API
To construct on-disk paths and storage API objects
As a primary key on the (internal) tables that store metadata associated to each object.
As a primary key in the archiver database

There are some advantages to the current scheme :

It's intrinsic: you can check the integrity of your data store without storing any metadata about the contents
It shards naturally thanks to statistical properties of the hashes

The scheme has a few drawbacks :

If there's a collision on your hash of choice, you lose
A hash is a big value (20 bytes for SHA1, probably 32+ bytes if we migrate to a successor such as SHA2, SHA3 or Blake2b), which makes any new metadata table instantly and inherently huge.
We want the next scheme to be as future-proof as possible, as migrating is not going to get easier in the future.

Of course, this is only an "internal" matter: our data model and deduplication functionality heavily depends on purely intrinsic hashes, which allow us to retain the properties of a Merkle DAG, as well as letting anyone recompute the intrinsic identifier of any object.

This task is a meta-task to decide the new primary key scheme, with further subtasks to track the individual migration items.

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T835 Migrate away from using sha1s as foreign keys in the database
		Migrated	gitlab-migration	T698 Migrate the content store to a new (internal) primary key scheme

Event Timeline

olasd created this task.Mar 3 2017, 3:37 PM

zack added a subscriber: zack.Mar 3 2017, 3:40 PM

zack updated the task description. (Show Details)Mar 6 2017, 2:56 PM

Ack on the principle. But noting down a caveat for use case (3).

The tables associating metadata (or indexing terms) to content objects will be useful in general, even if extrapolated from the rest of the Software Heritage database. There is hence some value in having "meaningful" keys for those tables, it's just dumb to do so in the form those tables take when stored in our main database, because we waste space for no good reason.
This is to say that if/when we end up hosting those tables on 3rd party hosting, isolated from the rest of the DB, we will need to make sure that there they use "meaningful" keys.

I agree that metadata exports need to keep meaningful intrinsic identifiers as well.

The obvious, trivial option for our new primary key would be just using a sequential identifier.

However, I really like the current "intrinsic hash" property of our current object store primary key schema, which gives us the opportunity to detect some corruption even if all the copies of the main Software Heritage database disappear: it makes the object store kind of standalone.

Some combination of a (short) hash - e.g. Blake2B with 8 or 16-byte output - with a sequential integer would IMO be a reasonable compromise.

olasd mentioned this in D200: Deal with new checksum blake2s256 in storage.Mar 27 2017, 1:40 PM

olasd added a parent task: T835: Migrate away from using sha1s as foreign keys in the database.Nov 6 2017, 2:38 PM

This task has been migrated to GitLab.

Migrate the content store to a new (internal) primary key schemeClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Migrate the content store to a new (internal) primary key scheme
Closed, MigratedEdits Locked
Actions

Related Objects
Search...