Page MenuHomeSoftware Heritage

Persistent readonly perfect hash table
Closed, MigratedEdits Locked

Description

A python module implemented in C that ingests key/value pairs, writes them on file and creates a hash table to be used as an index, in the same file. The file is a Shard in the Read Storage.

Format

  • Format version
  • An index which is a hash table
  • The content of the objects

Let's assume the shard stores the following objects:

  • foo (len=3, sha256=2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae)
  • bar (len=3, sha256=fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9)
  • baz (len=3, sha256=baa5a0964d3320fbc0c6a922140453c8513ea24ab8fd0577034804a967248096)
  • quux (len=4, sha256=053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)
  • (the empty object, len=0, sha256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)

The content of the file would look like this:

header       | 00 |
             +----+                                                  
        0000 | 01 |
             +----+                                                  

index        |         sha256         |      offset       |        size       |
             | 00   01   ..   30   31 | 32   ..   38   39 | 40   ..   46   47 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
        0001 | 2c | 26 | .. | e7 | ae | 00 | .. | 00 | f1 | 00 | .. | 00 | 03 |
        0031 | fc | de | .. | 8f | b9 | 00 | .. | 00 | f4 | 00 | .. | 00 | 03 |
        0061 | ba | a5 | .. | 09 | 96 | 00 | .. | 00 | f7 | 00 | .. | 00 | 03 |
        0091 | 05 | 30 | .. | a0 | a2 | 00 | .. | 00 | fa | 00 | .. | 00 | 04 |
        00c1 | e3 | b0 | .. | b8 | 55 | 00 | .. | 00 | 00 | 00 | .. | 00 | 00 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
			 
payload      +----+----+----+
        00f1 | 66 | 6f | 6f |
        00f4 | 62 | 61 | 72 |  
        00f7 | 62 | 61 | 7a |
        00fa | 71 | 75 | 75 | 78 |
             +----+----+----+----+

Writing

It is assumed writing is done in batch, sequentially

Reading

  • HASH(SHA256) in the index
  • Seek to the object content to stream it to the caller in chunks of a given size

Revisions and Commits

Event Timeline

dachary changed the task status from Open to Work in Progress.Mar 8 2021, 10:08 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.
dachary renamed this task from Using a custom Hash Table format to Persistent readonly perfect hash table.Jul 19 2021, 2:29 PM
dachary updated the task description. (Show Details)

With the available explanations I don't understand how this format works.

Let's assume the shard would store the following objects:

  • foo (len=3, sha256=2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae)
  • bar (len=3, sha256=fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9)
  • baz (len=3, sha256=baa5a0964d3320fbc0c6a922140453c8513ea24ab8fd0577034804a967248096)
  • quux (len=4, sha256=053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)
  • (the empty object, len=0, sha256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)

What would

  • the write operation for all objects
  • the resulting on-disk format
  • a read operation for an object of a given hash (let's say 053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)

look like?

My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, sorted, at the beginning of the file, and to do the read of an index entry as a bisection of this list of tuples. This would mean that the write operation can stream all the objects, but needs to store the list of hash/offset/length to sort it and write it to the file eventually.

The content of a file:

  • Byte 0: 1 # the format which includes the parameters to the perfect hash function
  • I0 Byte +1: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
  • Byte +32: \O1 # encoded on 8 bytes
  • L0 Byte +8: 3 # length of foo encoded on 8 bytes
  • I1 Byte +8: fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
  • Byte +32: \O2 # encoded on 8 bytes
  • L1 Byte +8: 3 # length of bar encoded on 8 bytes
  • \O1 Byte +32: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
  • Byte +32: foo
  • \O2 Byte +3: fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
  • Byte +32: bar

Hopefully that's clarifies how the file is create but I can go into details if that's still unclear.

Reading fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 is:

  • hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1 # entry one
  • 1 * size of an entry in the hash table (32 + 8 + 8) + size of the header == I1
  • seek I1
  • read the offset \O1 and the length L1 from I1
  • seek \O1
  • read SHA256, read L1 bytes

My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, ...

Which the index is.

do the read of an index entry as a bisection of this list of tuples.

This would be O(log(N)) where the proposed format is O(1) and that will have a significant impact as the storage grows. Reading the object will require only one additional read in the index instead of more.

sorry I don't understand everything here:

  • what are "the parameters to the perfect hash functions"? what are the possible formats?
  • what are \O<n>, l<n> ans L<n> in this example? are they positions of actual objects?

Is this the resulting table from @olasd's example:

header       | 00 |
             +----+                                                  
        0000 | 01 |
             +----+                                                  

index        |         sha256         |      offset       |        size       |
             | 00   01   ..   30   31 | 32   ..   38   39 | 40   ..   46   47 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
        0001 | 2c | 26 | .. | e7 | ae | 00 | .. | 00 | f1 | 00 | .. | 00 | 03 |
        0031 | fc | de | .. | 8f | b9 | 00 | .. | 00 | f4 | 00 | .. | 00 | 03 |
        0061 | ba | a5 | .. | 09 | 96 | 00 | .. | 00 | f7 | 00 | .. | 00 | 03 |
        0091 | 05 | 30 | .. | a0 | a2 | 00 | .. | 00 | fa | 00 | .. | 00 | 04 |
        00c1 | e3 | b0 | .. | b8 | 55 | 00 | .. | 00 | 00 | 00 | .. | 00 | 00 |
             +----+----+----+----+----+----+----+----+----+----+----+----+----+
			 
payload      +----+----+----+
        00f1 | 66 | 6f | 6f |
        00f4 | 62 | 61 | 72 |  
        00f7 | 62 | 61 | 7a |
        00fa | 71 | 75 | 75 | 78 |
             +----+----+----+----+

Then for the reading, I don't understand this item "hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1"
What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?

what are "the parameters to the perfect hash functions"? what are the possible formats?

Given a set of keys, calculating the perfect hash function that will avoid any collision yields a set of parameters. The lookup function needs these parameters in addition to the key to figure out the position of the entry in the hash table. That is a few bits per entry (see https://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf for instance). The possible formats depends on the actual function being implemented.

what are \O<n>, l<n> ans L<n> in this example? are they positions of actual objects?

Lets forget about that and use @olasd's example and your very nice table drawn from it. I thought it would be useful to have some redundancy of the SHA256 so that both the index and the content are self contained but it is a detail.

What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?

It is the hash function that, given the SHA256 of an object, will return the position in the index of the entry that has the offset of the corresponding object.

dachary updated the task description. (Show Details)

I just realized that since a perfect hash function need parameters that may require additional sequential reads at the beginning of the file, it would actually make more sense to have a regular hash function with a format that allows for collisions. Even if the collisions are relatively frequent, the colliding entries may be stored adjacent to each other and will not require an additional read. They are likely to be in the same block most of the time. That would save the trouble of implementing a perfect hash function.

the colliding entries may be stored adjacent to each other...

This cannot be done. Bad idea...

A 100GB file can have 25M objects (4KB median size). If a perfect hash function requires 4bits per entry, that's reading ~12MB for every lookup.

There also is the option of storing the offset of the object in the global index. The index is only necessary when there is no global index.

The Compress, Hash and Displace: CHD Algorithm described in http://cmph.sourceforge.net/papers/esa09.pdf generates a hash function under 4MB for ~30M keys, 32 bytes each.

$ dd if=/dev/urandom bs=32 count=25000000 | base64 --wrap 32 > /tmp/r
$ ls -lh /tmp/r
-rw-rw-r-- 1 loic loic 1,1G juil. 19 19:11 /tmp/r
$ sudo apt install cmph-tools
$ cmph -a chd_ph -t 2 -g /tmp/r
space lower bound is 0.000 bits per key
Starting mapping step for mph creation of 33333334 keys with 33333347 bins
Starting ordering step
Starting searching step
Starting compressing step
Successfully generated minimal perfect hash function

real	0m16,081s
user	0m15,256s
sys	0m0,816s
$ ls -lh /tmp/r.mph
-rw-rw-r-- 1 loic loic 3,0M juil. 19 20:39 /tmp/r.mph

Pre-loading all of them in memory would cost a maximum ~4MB for each 100GB Shard, i.e. 40GB RAM for 1PB. It may not even be necessary to explicitly cache them since they will be read frequently.

In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store). This way, the per-shard index only gets used for standalone operation, which would (probably?) be an edge case. As I don't really grasp the specifics of the hashing algorithm, I have a hard time understanding how the actual performance will look like. log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm?

In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store)

This is tempting, I agree. The only problem is that the global index does not scale out. There may be a solution involving a different method to store the global index in the Write Storage (Postgres) and the the Read Storage (hash table). Or a mix of both. The goal would be to make it so the probability that looking up an object in the Read Storage requires a lookup in the global index of the Write Storage decreases as the size of the Read Storage increases.

log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm

The CPU requirements of the hash function are marginal for lookups and quite fast at build time (seconds in a global process that requires minutes). Even a single additional read to retrieve one object is expensive because I/O is the bottleneck. The index of a 100GB Shard is about 1GB and 2+ lookup in the index will require two+ I/O: they are unlikely to be in cache.

dachary changed the task status from Work in Progress to Open.Aug 29 2021, 1:08 PM

@olasd @douardda @thomash05 : the following passes tox -e py3 therefore it is not complete nonsense. However it raises two questions:

  • Is it ok to have python modules with C code in them?
  • If it is ok, where should the dependencies be installed (i.e. libcmph-dev)?

Thanks for your input :-)

diff --git a/MANIFEST.in b/MANIFEST.in
index 3148ae2..b61989b 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -2,3 +2,6 @@ include Makefile
 include requirements*.txt
 include version.txt
 recursive-include swh py.typed
+include swh/perfecthash.c
+include swh/perfecthash.h
+include perfecthash_build.py
diff --git a/perfecthash_build.py b/perfecthash_build.py
new file mode 100644
index 0000000..9322a41
--- /dev/null
+++ b/perfecthash_build.py
@@ -0,0 +1,24 @@
+from cffi import FFI
+
+ffibuilder = FFI()
+
+# cdef() expects a single string declaring the C types, functions and
+# globals needed to use the shared object. It must be in valid C syntax.
+ffibuilder.cdef(
+    """
+int build(char* path);
+"""
+)
+
+ffibuilder.set_source(
+    "_perfecthash_cffi",
+    """
+                      #include "swh/perfecthash.h"
+                      """,
+    sources=["swh/perfecthash.c"],
+    include_dirs=["."],
+    libraries=["cmph"],
+)  # library name, for the linker
+
+if __name__ == "__main__":
+    ffibuilder.compile(verbose=True)
diff --git a/requirements.txt b/requirements.txt
index 0f20a59..39c8696 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,6 +2,8 @@
 # should match https://pypi.python.org/pypi names. For the full spec or
 # dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html
 
+cffi
+
 # remote storage API server
 aiohttp >= 3
 click
diff --git a/setup.py b/setup.py
index 91c4fd0..3c44259 100755
--- a/setup.py
+++ b/setup.py
@@ -46,10 +46,11 @@ setup(
     url="https://forge.softwareheritage.org/diffusion/DOBJS",
     packages=find_packages(),
     install_requires=parse_requirements() + parse_requirements("swh"),
-    setup_requires=["setuptools-scm"],
+    setup_requires=["setuptools-scm", "cffi"],
     use_scm_version=True,
     extras_require={"testing": parse_requirements("test")},
     include_package_data=True,
+    cffi_modules=["perfecthash_build.py:ffibuilder"],
     entry_points="""
         [swh.cli.subcommands]
         objstorage=swh.objstorage.cli
diff --git a/swh/objstorage/tests/test_perfecthash.py b/swh/objstorage/tests/test_perfecthash.py
new file mode 100644
index 0000000..8d26275
--- /dev/null
+++ b/swh/objstorage/tests/test_perfecthash.py
@@ -0,0 +1,5 @@
+from _perfecthash_cffi import lib
+
+
+def test_build():
+    assert lib.build("path") == 0
diff --git a/swh/perfecthash.c b/swh/perfecthash.c
new file mode 100644
index 0000000..b6670ec
--- /dev/null
+++ b/swh/perfecthash.c
@@ -0,0 +1,5 @@
+#include "swh/perfecthash.h"
+
+int build(char *path) {
+    return 0;
+}
diff --git a/swh/perfecthash.h b/swh/perfecthash.h
new file mode 100644
index 0000000..1dbf05b
--- /dev/null
+++ b/swh/perfecthash.h
@@ -0,0 +1 @@
+#include <cmph.h>

Ideally, since the perfecthash feature will be needed only for a specific objstorage backend, it should be an optional dependency.

Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?

Or use this one maybe https://github.com/GregBowyer/cmph-cffi ?

Source for the cmph-cffi package in pypi seems to be https://github.com/venkateshks/cmph-cffi (well at least there are tags in there)

Ideally, since the perfecthash feature will be needed only for a specific objstorage backend, it should be an optional dependency.

Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?

Or use this one maybe https://github.com/GregBowyer/cmph-cffi ?

Source for the cmph-cffi package in pypi seems to be https://github.com/venkateshks/cmph-cffi (well at least there are tags in there)

Humm not sure what the deal is there:

https://github.com/GregBowyer/cmph-cffi/commit/5b797b1d098759a47b47efb379da147b9100093c

Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?

It would but who would maintain it in the long run ?

Humm not sure what the deal is there:

Yeah, there has been some kind of split over five years ago, reason why I was not very keen on re-using these. Moreover they include the sources of cmph and it seems easier to link with the packaged version.

Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?

It would but who would maintain it in the long run ?

SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.

Humm not sure what the deal is there:

Yeah, there has been some kind of split over five years ago, reason why I was not very keen on re-using these. Moreover they include the sources of cmph and it seems easier to link with the packaged version.

So does it make sense to use this package instead of reimplementing one? What's the catch?

SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.

Ok.

So does it make sense to use this package instead of reimplementing one? What's the catch?

I was concerned about adding a dependency to a five+ years old unmaintained pypi binary package. I did not even try to use it. If your recommendation is to go for it, that's what I'll do.

@douardda

SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.

If I understand correctly, you're suggesting that I create a package at the same level as https://forge.softwareheritage.org/source/puppet-swh-site/, right ? For instance https://forge.softwareheritage.org/source/swh-perfecthash/ by following the instructions from the documentation.

So does it make sense to use this package instead of reimplementing one? What's the catch?

In addition to being unmaintained, the current packages do not capture all SWH needs to be implemented in C and expose interfaces that are not useful. The goal is to use the CHD algorithm and not another: there is no need to expose them. Even if the decision is made to use another one at a later time, it will be a change internal to the module and not exposed via the API. The API of the module will be something similar to the existing cmph modules but with a different semantic: the list of keys must not be read from the shard, converted to python objects, provided as arguments to the generate_hash function only to be converted back to C objects to compute the hash function. The generate_hash function should be given a file descriptor to read the ~30 millions keys and a file descriptor to write the hash function.

@douardda

SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.

If I understand correctly, you're suggesting that I create a package at the same level as https://forge.softwareheritage.org/source/puppet-swh-site/, right ? For instance https://forge.softwareheritage.org/source/swh-perfecthash/ by following the instructions from the documentation.

So does it make sense to use this package instead of reimplementing one? What's the catch?

In addition to being unmaintained,

this could be addressed by asking authors to be in charge of the package

the current packages do not capture all SWH needs to be implemented in C and expose interfaces that are not useful. The goal is to use the CHD algorithm and not another: there is no need to expose them. Even if the decision is made to use another one at a later time, it will be a change internal to the module and not exposed via the API. The API of the module will be something similar to the existing cmph modules but with a different semantic: the list of keys must not be read from the shard, converted to python objects, provided as arguments to the generate_hash function only to be converted back to C objects to compute the hash function. The generate_hash function should be given a file descriptor to read the ~30 millions keys and a file descriptor to write the hash function.

That I did not know, so indeed, if we need a specific wrapper for our needs, then it make sense to create a dedicated swh-perfecthash package. In any case, I would prefer this package to be standalone rather than embedded in swh-objstorage (at least for packaging/distribution reasons).

In addition to being unmaintained,

this could be addressed by asking authors to be in charge of the package

I'm not sure how to do that? Do you mean, for instance, asking Greg Bowyer, author of https://github.com/GregBowyer/cmph-cffi to be in charge of which package? I'm probably slow today :-)

That I did not know, so indeed, if we need a specific wrapper for our needs, ...

I should have clarified that even if there was a well maintained cmph module, there would still be a need for a software heritage specific C package. The goal is to optimize the creation of the perfect hash table. During benchmarking I noticed that creating, writing and reading millions of binary entries from a file in pure python takes time, reason why I decided to go for a C implementation.

it make sense to create a dedicated swh-perfecthash package.

Should this package be included in/a child of swh-environment or not?

I'd like to create a new package ( swh-objstorage-hash) and https://docs.softwareheritage.org/devel/tutorials/add-new-package.html is presumably the guide to do that. I however do not have the required permissions: would someone be so kind as to work with me on this?

I created a package and working on the code in the meantime. Here is how it looks right now (it's not much ;-) ).

olasd changed the status of subtask T3634: Create swh-perfecthash module from Open to Work in Progress.Oct 6 2021, 3:52 PM

For the record I created a "draft" repository for contributions to https://forge.softwareheritage.org/source/swh-perfecthash/ at https://git.easter-eggs.org/biceps/swh-perfecthash. It is only meant to be a publicly available sandbox.

dachary changed the task status from Open to Work in Progress.Oct 18 2021, 9:00 PM
dachary changed the status of subtask T3520: Persistent readonly perfect hash table: implementation from Open to Work in Progress.
dachary changed the status of subtask T3521: Persistent readonly perfect hash table: benchmarks from Open to Work in Progress.