Page MenuHomeSoftware Heritage

Migrate the content store to a new (internal) primary key scheme
Closed, MigratedEdits Locked

Description

Our content store, as well as all the associated data, uses the sha1 of the data as a primary key.

This content store primary key is used in several places, notably :

  1. As an accessor in our common object storage API
  2. To construct on-disk paths and storage API objects
  3. As a primary key on the (internal) tables that store metadata associated to each object.
  4. As a primary key in the archiver database

There are some advantages to the current scheme :

  • It's intrinsic: you can check the integrity of your data store without storing any metadata about the contents
  • It shards naturally thanks to statistical properties of the hashes

The scheme has a few drawbacks :

  • If there's a collision on your hash of choice, you lose
  • A hash is a big value (20 bytes for SHA1, probably 32+ bytes if we migrate to a successor such as SHA2, SHA3 or Blake2b), which makes any new metadata table instantly and inherently huge.
  • We want the next scheme to be as future-proof as possible, as migrating is not going to get easier in the future.

Of course, this is only an "internal" matter: our data model and deduplication functionality heavily depends on purely intrinsic hashes, which allow us to retain the properties of a Merkle DAG, as well as letting anyone recompute the intrinsic identifier of any object.

This task is a meta-task to decide the new primary key scheme, with further subtasks to track the individual migration items.

Event Timeline

Ack on the principle. But noting down a caveat for use case (3).

The tables associating metadata (or indexing terms) to content objects will be useful in general, even if extrapolated from the rest of the Software Heritage database. There is hence some value in having "meaningful" keys for those tables, it's just dumb to do so in the form those tables take when stored in our main database, because we waste space for no good reason.
This is to say that if/when we end up hosting those tables on 3rd party hosting, isolated from the rest of the DB, we will need to make sure that there they use "meaningful" keys.

I agree that metadata exports need to keep meaningful intrinsic identifiers as well.

The obvious, trivial option for our new primary key would be just using a sequential identifier.

However, I really like the current "intrinsic hash" property of our current object store primary key schema, which gives us the opportunity to detect some corruption even if all the copies of the main Software Heritage database disappear: it makes the object store kind of standalone.

Some combination of a (short) hash - e.g. Blake2B with 8 or 16-byte output - with a sequential integer would IMO be a reasonable compromise.