Our content store, as well as all the associated data, uses the sha1 of the data as a primary key.
This content store primary key is used in several places, notably :
- As an accessor in our common object storage API
- To construct on-disk paths and storage API objects
- As a primary key on the (internal) tables that store metadata associated to each object.
- As a primary key in the archiver database
There are some advantages to the current scheme :
- It's intrinsic: you can check the integrity of your data store without storing any metadata about the contents
- It shards naturally thanks to statistical properties of the hashes
The scheme has a few drawbacks :
- If there's a collision on your hash of choice, you lose
- A hash is a big value (20 bytes for SHA1, probably 32+ bytes if we migrate to a successor such as SHA2, SHA3 or Blake2b), which makes any new metadata table instantly and inherently huge.
- We want the next scheme to be as future-proof as possible, as migrating is not going to get easier in the future.
Of course, this is only an "internal" matter: our data model and deduplication functionality heavily depends on purely intrinsic hashes, which allow us to retain the properties of a Merkle DAG, as well as letting anyone recompute the intrinsic identifier of any object.
This task is a meta-task to decide the new primary key scheme, with further subtasks to track the individual migration items.