HomeSoftware Heritage

create and lookup a Read Shard with a perfect hash


create and lookup a Read Shard with a perfect hash

This package is intended to be used by the new object storage, as
a low level dependency to create and lookup a Read Shard.

It is implemented in C and based on the cmph library for better
performances. It will be used when a Read Shard must be created with
around fifty millions objects, totaling around 100GB.

The objects and their key (their cryptographic signature) will be
retrieved, in python from the postgres database where the Write Shard
lives. One after the other they will be inserted in the Read Shard
using the write method. In the end the save method will create
the perfect hash table using the cmph library and store it in the
file (it typically takes a few seconds).

There is no write amplification during the creation of the Read Shard:
each byte is written exactly once, sequentially. There is no read
operation. The memory footprint is 2*n*32 where n is the number of
inserted keys.

The lookup method relies on the hash function which is loaded in
memory when the load function is called. It obtains the offset of
the object by looking up its offset in the file from an index which
may be up to 2x the number of keys (it is not minimal).

Signed-off-by: Loïc Dachary <>