Page MenuHomeSoftware Heritage

Perfect hashmap C implementation
ClosedPublic

Authored by dachary on Oct 6 2021, 4:25 PM.

Details

Summary

This is a working C implementation of a perfect hashmap with tests and decent code
coverage (all but error conditions).

Related to T3634

Test Plan

tox (it includes tests on the C part of the implementation)

Diff Detail

Repository
rDOPH Perfect Hash algorithm
Branch
wip-hash (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24386
Build 38062: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 38061: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ardumont I believe all your comments were addressed. It is lucky that you did not review the actual logic: while running benchmarks I ran into a bug that lead to a significant refactor (using fread/fwrite instead of mmap and addresses). Not to the point where you'll have to re-read everything from scratch though 😅

The sources were documented and reorganized for clarity. Please let me know if some areas are obscure: it is in my short term memory and I may have overlooked the need to explain a few things.

Thanks for your review!

Build is green

Patch application report for D6424 (id=23792)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 1437780b18244d5a5491c2ab066ce4a44ccf77d8
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/12/ for more details.

document the benchmark process

Build has FAILED

Patch application report for D6424 (id=23806)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 3dbe00343987924ec63c2305c1fb1aa557ac475a
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/13/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/13/console

fix documentation bugs

21:18:55 Warning, treated as error:
21:18:55 /var/lib/jenkins/workspace/DOPH/tests-on-diff/docs/benchmarks.rst:24:Title underline too short.
21:18:55

Build is green

Patch application report for D6424 (id=23807)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 81291ec25e5d4611b8ab91ca724ac2fe175147fe
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/14/ for more details.

docs/benchmarks.rst
52 ↗(On Diff #23807)
swh/perfecthash/__init__.py
51

From the docstring.

60

From the docstring ;)

90

Something hit me, can't the lookup fail?

What would be the return if that was the case None?
(if so, the type must be changed to Optional[HashObject]).

swh/perfecthash/build.py
23

Then, the right effort needed is just to update a bit the existing comment, one extra sentence like this would do:

The following is only the necessary part parsed by cffi to generate python bindings.

What do you think?

swh/perfecthash/hash.c
117

Are those error log messages (i see those in all C functions) destined to say as is or
is there some plan to improve them a bit?

Note: I gather those printf instructions with those checks are actually some kind of
debug which serve as heads up "there is something fishy in that function, here is its
name, have a look". I'm not sure whether they are just here temporarily for debug
purposes or if they're gonna stay forever. If it's the latter, I guess they need some
improvments?

269

That looks like an unreachable statement.

conftest.py
18 ↗(On Diff #23807)

Looks good to me. Thanks.

Some typos and possibly one fix inline.

I've changed from group reviewers to "blocking" group reviewers so i can validate it as
myself without others no longer seeing that diff. That might let others have a chance to
look into the diff, maybe @douardda which had a look at some point.

Cheers,

swh/perfecthash/tests/test_hash.py
61

I don't get that comparison, where does the 5 ratio come from?

swh/perfecthash/tests/test_hash.py
61

It seems to be a ratio that satisfies what's explained in the readme earlier (both for large and small files)
but i still don't get where it comes from ;)

dachary added inline comments.
swh/perfecthash/__init__.py
90

I overlooked this question, sorry. The lookup cannot fail: if given a key that does not exist it will return a random object. Which is ok because the global index ensures the Read Shard is only queried with objects that are known to exist in the hash map.

swh/perfecthash/hash.c
117

The debug messags are for debug, the printf messages are here to stay.

They are designed to uniquely identify the location of the error in the source code. When arguments are provided they are also printed: they are all the information that can be provided to the user, instead of just "fail error fail" :-) In the case of this line, arguments are hardcoded and there is no gain in printing them again.

Does that make sense.

269

It does :-D

swh/perfecthash/tests/test_hash.py
61

The idea is to guard against massive performance regressions. When and if creating a shard is five times slower than copying a file... there is cause for concern. It comes from my head and has no real justification other than: this would be bad.

dachary marked 3 inline comments as done.

address ardrumont comments

swh/perfecthash/__init__.py
90

you did not overlook it, it's a new one!

If anything, that's me who overlooked that code last time ;)

Great! "Can't happen!"

Build is green

Patch application report for D6424 (id=23820)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 257c0c7fc1c4d89edfdd1434502a8b3d747d9f34
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/15/ for more details.

swh/perfecthash/hash.c
117

yes, it does, thanks.

swh/perfecthash/tests/test_hash.py
61

As a comment maybe something like:

# According to the docs/benchmarks.rst analysis, the duration is below 5 times the baseline time
# This assertion is here to ensure we do not not regress in the future...

or something

In any case, thanks for the explanation!

Thanks a lot for the reviews!

comment on hardcoded value

Build is green

Patch application report for D6424 (id=23821)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 4a7faf6436d4fb5b2a20931720ff82edf1501797
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/16/ for more details.

adjust lookup speed results and expectations according to today's benchmark results

Build is green

Patch application report for D6424 (id=23822)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 2fe3903c5bb4d416b0d6f67719b152f4aa886c98
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/17/ for more details.

dachary edited the test plan for this revision. (Show Details)
dachary marked 4 inline comments as done.

add two missing type annotations

Build is green

Patch application report for D6424 (id=23957)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 024fe5a8d1da7193a08a6bef3d66caefa3bd5604
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/18/ for more details.

@ardumont I revisited all comments and noticed I missed a few, sorry about that. They should be good now :-) Is there anything else you need me to do for this proposed patch? Or should I just be patient and wait for whatever comes next in the review merge process? Thanks for your guidance!

mypy.ini
14–18

This is a module that we have developed, would it be possible to write a .pyi file for it (maybe not now, but in a further diff)?

swh/perfecthash/build.py
36

Shouldn't this be prefixed by swh.perfecthash (and the module name be swh.perfecthash._hash_cffi)?

I still need to do a pass through the C code, which "feels" very dense. Some of it has debug statements which help navigate, but a lot doesn't so it's not too easy to go through (but that's mostly because it's been a while since I've been serious about reading some C code :D).

dachary added inline comments.
mypy.ini
14–18

I'll make a note about that, good idea.

swh/perfecthash/Makefile
17

For the record the contentious line was removed: it is entirely unecessary whatever the context.

swh/perfecthash/build.py
36

I'm unclear about the requirements of ffibuilder: it works as it is therefore I conclude it should not be different. But I can try with prefixing the module with swh.prefecthash if you think it is important.

swh/perfecthash/build.py
36

I mean, this looks like this is creating a "non-namespaced" extension module called _hash_cffi (which would be installed at the root of the python modules installation path), which sounds wrong.

dachary marked 3 inline comments as done.

move to _hash_cffi swh.perfecthash._hash_cffi

Build is green

Patch application report for D6424 (id=24002)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit f326c93363310fc37db262af511a59a1ff2f1ced
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/19/ for more details.

dachary added inline comments.
swh/perfecthash/build.py
36

Good catch: it is exactly like that and I overlooked it. I'm surprised it did not just break with tox. Fixed!

dachary added inline comments.
swh/perfecthash/build.py
36

Interesting. Since you're asking my preference would be to keep the name as it is but that's because I'm lazy. I'd be happy to change it if you think it's better though: I'm not that lazy ;-)

swh/perfecthash/build.py
36

heh, it doesn't matter

Some generic comments on the C side:

Could you document the expected shard layout in RestructuredText format? This code is setting the (current version of the) layout "forever", so we should make sure to properly document it.

If I understand correctly, the shard_header_t will be serialized at the beginning of the file/device/... containing a shard, correct?

If that is the case, then I think we should make the beginning of this struct more specific, with:

  • a magic string/number ("SWHShard" happens to fit in 64 bits/8 bytes)
  • a version number (probably 8 bytes, set to 1, in network byte order, to begin with)

and make sure that these match when we're reading the header of a shard again? I'm sure that we'll eventually have to move to "shards version 2", and we'll be happy to have a flag in the built-in header to be able to route the logic.

I don't think we should be storing fields defined as size_t on disk, as I believe that's architecture-dependent: I see that you have defined and use ntohq / htonq to, if I understand correctly, make sure that the endianness of the data stored on disk is always in network byte order (although currently we're just always swapping the byte orders, AFAICT). If we end up using an architecture with a 32 (or 128) bit size_t, or even rarer, a big endian architecture, we'll have a bad time.

I think we should:

  • store explicit uint64_t values on disk, with a specified byte order.
  • use off_t instead of size_t on the C side, for types that store file offsets (which should be a signed 64 bits value when compiled with -D_FILE_OFFSET_BITS=64, even on 32 bits systems). Unfortunately that means wasting a bit on offset storage, but I think we can live with that.

I am a bit (heh) nervous about the endianness gymnastics, which mean that shard_header_t struct items can either be in network byte order (when directly hydrated from disk) or in native byte order (within shard_t). Duplicating the struct types between "serialized" and "deserialized" types may be a bit overkill, but if we want uint64_ts on the serialized side, and 64-bit off_ts on the deserialized side, we probably have to do that anyway?

Finally, I'm not sure about the packing characteristics of these struct types. Considering that shard_header_load does a read of the shard header directly into a structure, I think we'll at least want to make it __attribute__((packed)) to ensure that the layout is always consistent regardless of architecture/alignment characteristics?

And a couple comments for the Python API:

Shard could be a context manager: __enter__ to open the file and load the shard index, __exit__ to unload the shard index and close the file. That way, one can use with Shard('/dev/xxx') as s: s.lookup(blabla) and have the initialization/teardown taken care of automatically.

I wonder if lookup should return a (minimalistic) file-like interface (with at least read(size), tell() and seek(offset) methods) instead of just giving you a blob of bytes fully fetched in memory. That's probably something we will want to do in another diff.

I initially considered suggesting giving Shard a mode parameter ('r' or 'w'), but after some consideration, and as the operations are always disjoint (i.e. you always look up objects, or you always write all the objects, but doing both is pretty rare), I think we should have two classes, one for a read-only shard (with only lookup methods), and one for a write-only shard (which takes a count of objects in the initializer, always wipes the shard, and always calls save to finalize the index in the __exit__ method, after the proper number of writes). The write-only class could also raise an exception if the number of writes doesn't match the expectation.

Could you document the expected shard layout in RestructuredText format? This code is setting the (current version of the) layout "forever", so we should make sure to properly document it.

It was aded as docs/format.rst

If I understand correctly, the shard_header_t will be serialized at the beginning of the file/device/... containing a shard, correct?

If that is the case, then I think we should make the beginning of this struct more specific, with:

a magic string/number ("SWHShard" happens to fit in 64 bits/8 bytes)

Done.

a version number (probably 8 bytes, set to 1, in network byte order, to begin with)

Done.

and make sure that these match when we're reading the header of a shard again? I'm sure that we'll eventually have to move to "shards version 2", and we'll be happy to have a flag in the built-in header to be able to route the logic.

Done.

I don't think we should be storing fields defined as size_t on disk, as I believe that's architecture-dependent: I see that you have defined and use ntohq / htonq to, if I understand correctly, make sure that the endianness of the data stored on disk is always in network byte order (although currently we're just always swapping the byte orders, AFAICT). If we end up using an architecture with a 32 (or 128) bit size_t, or even rarer, a big endian architecture, we'll have a bad time.

I think we should:

store explicit uint64_t values on disk, with a specified byte order.

Done.

use off_t instead of size_t on the C side, for types that store file offsets (which should be a signed 64 bits value when compiled with -D_FILE_OFFSET_BITS=64, even on 32 bits systems). Unfortunately that means wasting a bit on offset storage, but I think we can live with that.

I'm not sure what would be the upside so I kept it as u_int64_t. Do you have a specific scenario in mind where it would create a problem?

I am a bit (heh) nervous about the endianness gymnastics, which mean that shard_header_t struct items can either be in network byte order (when directly hydrated from disk) or in native byte order (within shard_t). Duplicating the struct types between "serialized" and "deserialized" types may be a bit overkill, but if we want uint64_ts on the serialized side, and 64-bit off_ts on the deserialized side, we probably have to do that anyway?

Finally, I'm not sure about the packing characteristics of these struct types. Considering that shard_header_load does a read of the shard header directly into a structure, I think we'll at least want to make it attribute((packed)) to ensure that the layout is always consistent regardless of architecture/alignment characteristics?

Good catch: ntohq/ntohq were missing the big endian counterpart (yerk!). Writing/reading structs directly is unecessarily difficult in this case, it was replaced with writing individual u_int64_t in network order instead.

And a couple comments for the Python API:

Shard could be a context manager: enter to open the file and load the shard index, exit to unload the shard index and close the file. That way, one can use with Shard('/dev/xxx') as s: s.lookup(blabla) and have the initialization/teardown taken care of automatically.

I wonder if lookup should return a (minimalistic) file-like interface (with at least read(size), tell() and seek(offset) methods) instead of just giving you a blob of bytes fully fetched in memory. That's probably something we will want to do in another diff.

I initially considered suggesting giving Shard a mode parameter ('r' or 'w'), but after some consideration, and as the operations are always disjoint (i.e. you always look up objects, or you always write all the objects, but doing both is pretty rare), I think we should have two classes, one for a read-only shard (with only lookup methods), and one for a write-only shard (which takes a count of objects in the initializer, always wipes the shard, and always calls save to finalize the index in the exit method, after the proper number of writes). The write-only class could also raise an exception if the number of writes doesn't match the expectation.

This all makes perfect sense to me. However I'd rather table them until the package is actually used. I suspect the code using this module will drive the desired API although I'm unable to figure out what would be the most natural fit right now. I'm sensing refining the API at the moment would slightly be over engineering. But ... that may be me being lazy or lacking vision and I'll defer to your better judgement in any case :-)

Build is green

Patch application report for D6424 (id=24055)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 5897b0dd584b834b3bf8cd3739bcfba5b352552a
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/20/ for more details.

I don't think we should be storing fields defined as size_t on disk, as I believe that's architecture-dependent: I see that you have defined and use ntohq / htonq to, if I understand correctly, make sure that the endianness of the data stored on disk is always in network byte order (although currently we're just always swapping the byte orders, AFAICT). If we end up using an architecture with a 32 (or 128) bit size_t, or even rarer, a big endian architecture, we'll have a bad time.

I think we should:

store explicit uint64_t values on disk, with a specified byte order.

Done.

use off_t instead of size_t on the C side, for types that store file offsets (which should be a signed 64 bits value when compiled with -D_FILE_OFFSET_BITS=64, even on 32 bits systems). Unfortunately that means wasting a bit on offset storage, but I think we can live with that.

I'm not sure what would be the upside so I kept it as u_int64_t. Do you have a specific scenario in mind where it would create a problem?

Is u_int64_t something that comes from cmph ? That's pretty gratuitously inconsistent with the standard C uint64_t (don't bother changing it).

Anyway, there's a couple of issues I can think of:

  • the fseek/ftell APIs only work on native longs. For 32 bit architectures, that's only up to 0x7fffffff (=2GiB). That's what -D_FILE_OFFSET_BITS=64 and using lseek should address: they make off_t a long long (64 bit signed integer) on 32-bit archs.
  • u_int64_t conversion to signed offset should formally check for overflow, even though "it shouldn't happen in practice"™.
  • the converse signed offset > uint64_t conversion needs to check for underflow, but you've done that.

(can you guess why I dislike C?)

I am a bit (heh) nervous about the endianness gymnastics, which mean that shard_header_t struct items can either be in network byte order (when directly hydrated from disk) or in native byte order (within shard_t). Duplicating the struct types between "serialized" and "deserialized" types may be a bit overkill, but if we want uint64_ts on the serialized side, and 64-bit off_ts on the deserialized side, we probably have to do that anyway?

Finally, I'm not sure about the packing characteristics of these struct types. Considering that shard_header_load does a read of the shard header directly into a structure, I think we'll at least want to make it attribute((packed)) to ensure that the layout is always consistent regardless of architecture/alignment characteristics?

Good catch: ntohq/ntohq were missing the big endian counterpart (yerk!). Writing/reading structs directly is unecessarily difficult in this case, it was replaced with writing individual u_int64_t in network order instead.

:-)

And a couple comments for the Python API:

Shard could be a context manager: enter to open the file and load the shard index, exit to unload the shard index and close the file. That way, one can use with Shard('/dev/xxx') as s: s.lookup(blabla) and have the initialization/teardown taken care of automatically.

I wonder if lookup should return a (minimalistic) file-like interface (with at least read(size), tell() and seek(offset) methods) instead of just giving you a blob of bytes fully fetched in memory. That's probably something we will want to do in another diff.

I initially considered suggesting giving Shard a mode parameter ('r' or 'w'), but after some consideration, and as the operations are always disjoint (i.e. you always look up objects, or you always write all the objects, but doing both is pretty rare), I think we should have two classes, one for a read-only shard (with only lookup methods), and one for a write-only shard (which takes a count of objects in the initializer, always wipes the shard, and always calls save to finalize the index in the exit method, after the proper number of writes). The write-only class could also raise an exception if the number of writes doesn't match the expectation.

This all makes perfect sense to me. However I'd rather table them until the package is actually used. I suspect the code using this module will drive the desired API although I'm unable to figure out what would be the most natural fit right now. I'm sensing refining the API at the moment would slightly be over engineering. But ... that may be me being lazy or lacking vision and I'll defer to your better judgement in any case :-)

Yeah, that makes sense.

Anyway, there's a couple of issues I can think of:...

Good points. I made the following changes to address them:

  • use fseeko/ftello to guard against long overflow
  • check uint64_t overflow before using as off_t
  • compile with -D_FILE_OFFSET_BITS=64
  • s/u_int64_t/uint64_t/

(can you guess why I dislike C?)

I think I can :-)

Build is green

Patch application report for D6424 (id=24097)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 46cf1764d84648117ce35b54a2e1475873f964fc
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/21/ for more details.

Build is green

Patch application report for D6424 (id=24098)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 9266eaa6ead429e1e754c539100d7449895fdb14
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal).
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/22/ for more details.

This revision is now accepted and ready to land.Nov 19 2021, 5:54 PM