Page MenuHomeSoftware Heritage

Persistent readonly perfect hash table
Open, NormalPublic

Description

A python module implemented in C that ingests key/value pairs, writes them on file and creates a hash table to be used as an index, in the same file. The file is a Shard in the Read Storage.

Format

  • Format version
  • An index which is a hash table
  • The content of the objects

Let's assume the shard stores the following objects:

  • foo (len=3, sha256=2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae)
  • bar (len=3, sha256=fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9)
  • baz (len=3, sha256=baa5a0964d3320fbc0c6a922140453c8513ea24ab8fd0577034804a967248096)
  • quux (len=4, sha256=053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)
  • (the empty object, len=0, sha256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)

The content of the file would look like this:

header       | 00 |
             +----+                                                  
        0000 | 01 |
             +----+                                                  

index        |         sha256         |      offset       |        size       |
             | 00   01   ..   30   31 | 32   ..   38   39 | 40   ..   46   47 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
        0001 | 2c | 26 | .. | e7 | ae | 00 | .. | 00 | f1 | 00 | .. | 00 | 03 |
        0031 | fc | de | .. | 8f | b9 | 00 | .. | 00 | f4 | 00 | .. | 00 | 03 |
        0061 | ba | a5 | .. | 09 | 96 | 00 | .. | 00 | f7 | 00 | .. | 00 | 03 |
        0091 | 05 | 30 | .. | a0 | a2 | 00 | .. | 00 | fa | 00 | .. | 00 | 04 |
        00c1 | e3 | b0 | .. | b8 | 55 | 00 | .. | 00 | 00 | 00 | .. | 00 | 00 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
			 
payload      +----+----+----+
        00f1 | 66 | 6f | 6f |
        00f4 | 62 | 61 | 72 |  
        00f7 | 62 | 61 | 7a |
        00fa | 71 | 75 | 75 | 78 |
             +----+----+----+----+

Writing

It is assumed writing is done in batch, sequentially

Reading

  • HASH(SHA256) in the index
  • Seek to the object content to stream it to the caller in chunks of a given size

Event Timeline

dachary changed the task status from Open to Work in Progress.Mar 8 2021, 10:08 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.
dachary renamed this task from Using a custom Hash Table format to Persistent readonly perfect hash table.Jul 19 2021, 2:29 PM
dachary updated the task description. (Show Details)

With the available explanations I don't understand how this format works.

Let's assume the shard would store the following objects:

  • foo (len=3, sha256=2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae)
  • bar (len=3, sha256=fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9)
  • baz (len=3, sha256=baa5a0964d3320fbc0c6a922140453c8513ea24ab8fd0577034804a967248096)
  • quux (len=4, sha256=053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)
  • (the empty object, len=0, sha256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)

What would

  • the write operation for all objects
  • the resulting on-disk format
  • a read operation for an object of a given hash (let's say 053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)

look like?

My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, sorted, at the beginning of the file, and to do the read of an index entry as a bisection of this list of tuples. This would mean that the write operation can stream all the objects, but needs to store the list of hash/offset/length to sort it and write it to the file eventually.

The content of a file:

  • Byte 0: 1 # the format which includes the parameters to the perfect hash function
  • I0 Byte +1: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
  • Byte +32: \O1 # encoded on 8 bytes
  • L0 Byte +8: 3 # length of foo encoded on 8 bytes
  • I1 Byte +8: fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
  • Byte +32: \O2 # encoded on 8 bytes
  • L1 Byte +8: 3 # length of bar encoded on 8 bytes
  • \O1 Byte +32: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
  • Byte +32: foo
  • \O2 Byte +3: fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
  • Byte +32: bar

Hopefully that's clarifies how the file is create but I can go into details if that's still unclear.

Reading fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 is:

  • hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1 # entry one
  • 1 * size of an entry in the hash table (32 + 8 + 8) + size of the header == I1
  • seek I1
  • read the offset \O1 and the length L1 from I1
  • seek \O1
  • read SHA256, read L1 bytes

My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, ...

Which the index is.

do the read of an index entry as a bisection of this list of tuples.

This would be O(log(N)) where the proposed format is O(1) and that will have a significant impact as the storage grows. Reading the object will require only one additional read in the index instead of more.

sorry I don't understand everything here:

  • what are "the parameters to the perfect hash functions"? what are the possible formats?
  • what are \O<n>, l<n> ans L<n> in this example? are they positions of actual objects?

Is this the resulting table from @olasd's example:

header       | 00 |
             +----+                                                  
        0000 | 01 |
             +----+                                                  

index        |         sha256         |      offset       |        size       |
             | 00   01   ..   30   31 | 32   ..   38   39 | 40   ..   46   47 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
        0001 | 2c | 26 | .. | e7 | ae | 00 | .. | 00 | f1 | 00 | .. | 00 | 03 |
        0031 | fc | de | .. | 8f | b9 | 00 | .. | 00 | f4 | 00 | .. | 00 | 03 |
        0061 | ba | a5 | .. | 09 | 96 | 00 | .. | 00 | f7 | 00 | .. | 00 | 03 |
        0091 | 05 | 30 | .. | a0 | a2 | 00 | .. | 00 | fa | 00 | .. | 00 | 04 |
        00c1 | e3 | b0 | .. | b8 | 55 | 00 | .. | 00 | 00 | 00 | .. | 00 | 00 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
			 
payload      +----+----+----+
        00f1 | 66 | 6f | 6f |
        00f4 | 62 | 61 | 72 |  
        00f7 | 62 | 61 | 7a |
        00fa | 71 | 75 | 75 | 78 |
             +----+----+----+----+

Then for the reading, I don't understand this item "hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1"
What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?

what are "the parameters to the perfect hash functions"? what are the possible formats?

Given a set of keys, calculating the perfect hash function that will avoid any collision yields a set of parameters. The lookup function needs these parameters in addition to the key to figure out the position of the entry in the hash table. That is a few bits per entry (see https://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf for instance). The possible formats depends on the actual function being implemented.

what are \O<n>, l<n> ans L<n> in this example? are they positions of actual objects?

Lets forget about that and use @olasd's example and your very nice table drawn from it. I thought it would be useful to have some redundancy of the SHA256 so that both the index and the content are self contained but it is a detail.

What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?

It is the hash function that, given the SHA256 of an object, will return the position in the index of the entry that has the offset of the corresponding object.

dachary updated the task description. (Show Details)

I just realized that since a perfect hash function need parameters that may require additional sequential reads at the beginning of the file, it would actually make more sense to have a regular hash function with a format that allows for collisions. Even if the collisions are relatively frequent, the colliding entries may be stored adjacent to each other and will not require an additional read. They are likely to be in the same block most of the time. That would save the trouble of implementing a perfect hash function.

the colliding entries may be stored adjacent to each other...

This cannot be done. Bad idea...

A 100GB file can have 25M objects (4KB median size). If a perfect hash function requires 4bits per entry, that's reading ~12MB for every lookup.

There also is the option of storing the offset of the object in the global index. The index is only necessary when there is no global index.

The Compress, Hash and Displace: CHD Algorithm described in http://cmph.sourceforge.net/papers/wads07.pdf generates a hash function under 4MB for ~30M keys, 32 bytes each.

$ dd if=/dev/urandom bs=32 count=25000000 | base64 --wrap 32 > /tmp/r
$ ls -lh /tmp/r
-rw-rw-r-- 1 loic loic 1,1G juil. 19 19:11 /tmp/r
$ sudo apt install cmph-tools
$ cmph -a chd_ph -t 2 -g /tmp/r
space lower bound is 0.000 bits per key
Starting mapping step for mph creation of 33333334 keys with 33333347 bins
Starting ordering step
Starting searching step
Starting compressing step
Successfully generated minimal perfect hash function

real	0m16,081s
user	0m15,256s
sys	0m0,816s
$ ls -lh /tmp/r.mph
-rw-rw-r-- 1 loic loic 3,0M juil. 19 20:39 /tmp/r.mph

Pre-loading all of them in memory would cost a maximum ~4MB for each 100GB Shard, i.e. 40GB RAM for 1PB. It may not even be necessary to explicitly cache them since they will be read frequently.

In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store). This way, the per-shard index only gets used for standalone operation, which would (probably?) be an edge case. As I don't really grasp the specifics of the hashing algorithm, I have a hard time understanding how the actual performance will look like. log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm?

In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store)

This is tempting, I agree. The only problem is that the global index does not scale out. There may be a solution involving a different method to store the global index in the Write Storage (Postgres) and the the Read Storage (hash table). Or a mix of both. The goal would be to make it so the probability that looking up an object in the Read Storage requires a lookup in the global index of the Write Storage decreases as the size of the Read Storage increases.

log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm

The CPU requirements of the hash function are marginal for lookups and quite fast at build time (seconds in a global process that requires minutes). Even a single additional read to retrieve one object is expensive because I/O is the bottleneck. The index of a 100GB Shard is about 1GB and 2+ lookup in the index will require two+ I/O: they are unlikely to be in cache.

dachary changed the task status from Work in Progress to Open.Sun, Aug 29, 1:08 PM