This is a working C implementation of a perfect hashmap with tests and decent code
coverage (all but error conditions).
Related to T3634
Differential D6424
Perfect hashmap C implementation dachary on Oct 6 2021, 4:25 PM. Authored by
Details
This is a working C implementation of a perfect hashmap with tests and decent code Related to T3634 tox (it includes tests on the C part of the implementation)
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes Comment Actions @ardumont I believe all your comments were addressed. It is lucky that you did not review the actual logic: while running benchmarks I ran into a bug that lead to a significant refactor (using fread/fwrite instead of mmap and addresses). Not to the point where you'll have to re-read everything from scratch though 😅 The sources were documented and reorganized for clarity. Please let me know if some areas are obscure: it is in my short term memory and I may have overlooked the need to explain a few things. Thanks for your review! Comment Actions Build is green Patch application report for D6424 (id=23792)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 1437780b18244d5a5491c2ab066ce4a44ccf77d8 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/12/ for more details. Comment Actions Build has FAILED Patch application report for D6424 (id=23806)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 3dbe00343987924ec63c2305c1fb1aa557ac475a Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/13/ Comment Actions fix documentation bugs 21:18:55 Warning, treated as error: Comment Actions Build is green Patch application report for D6424 (id=23807)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 81291ec25e5d4611b8ab91ca724ac2fe175147fe Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/14/ for more details.
Comment Actions Looks good to me. Thanks. Some typos and possibly one fix inline. I've changed from group reviewers to "blocking" group reviewers so i can validate it as Cheers,
Comment Actions Build is green Patch application report for D6424 (id=23820)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 257c0c7fc1c4d89edfdd1434502a8b3d747d9f34 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/15/ for more details.
Comment Actions Build is green Patch application report for D6424 (id=23821)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 4a7faf6436d4fb5b2a20931720ff82edf1501797 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/16/ for more details. Comment Actions Build is green Patch application report for D6424 (id=23822)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 2fe3903c5bb4d416b0d6f67719b152f4aa886c98 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/17/ for more details. Comment Actions Build is green Patch application report for D6424 (id=23957)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 024fe5a8d1da7193a08a6bef3d66caefa3bd5604 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/18/ for more details. Comment Actions @ardumont I revisited all comments and noticed I missed a few, sorry about that. They should be good now :-) Is there anything else you need me to do for this proposed patch? Or should I just be patient and wait for whatever comes next in the review merge process? Thanks for your guidance! Comment Actions I still need to do a pass through the C code, which "feels" very dense. Some of it has debug statements which help navigate, but a lot doesn't so it's not too easy to go through (but that's mostly because it's been a while since I've been serious about reading some C code :D).
Comment Actions Build is green Patch application report for D6424 (id=24002)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit f326c93363310fc37db262af511a59a1ff2f1ced Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/19/ for more details.
Comment Actions Some generic comments on the C side: Could you document the expected shard layout in RestructuredText format? This code is setting the (current version of the) layout "forever", so we should make sure to properly document it. If I understand correctly, the shard_header_t will be serialized at the beginning of the file/device/... containing a shard, correct? If that is the case, then I think we should make the beginning of this struct more specific, with:
and make sure that these match when we're reading the header of a shard again? I'm sure that we'll eventually have to move to "shards version 2", and we'll be happy to have a flag in the built-in header to be able to route the logic. I don't think we should be storing fields defined as size_t on disk, as I believe that's architecture-dependent: I see that you have defined and use ntohq / htonq to, if I understand correctly, make sure that the endianness of the data stored on disk is always in network byte order (although currently we're just always swapping the byte orders, AFAICT). If we end up using an architecture with a 32 (or 128) bit size_t, or even rarer, a big endian architecture, we'll have a bad time. I think we should:
I am a bit (heh) nervous about the endianness gymnastics, which mean that shard_header_t struct items can either be in network byte order (when directly hydrated from disk) or in native byte order (within shard_t). Duplicating the struct types between "serialized" and "deserialized" types may be a bit overkill, but if we want uint64_ts on the serialized side, and 64-bit off_ts on the deserialized side, we probably have to do that anyway? Finally, I'm not sure about the packing characteristics of these struct types. Considering that shard_header_load does a read of the shard header directly into a structure, I think we'll at least want to make it __attribute__((packed)) to ensure that the layout is always consistent regardless of architecture/alignment characteristics? And a couple comments for the Python API: Shard could be a context manager: __enter__ to open the file and load the shard index, __exit__ to unload the shard index and close the file. That way, one can use with Shard('/dev/xxx') as s: s.lookup(blabla) and have the initialization/teardown taken care of automatically. I wonder if lookup should return a (minimalistic) file-like interface (with at least read(size), tell() and seek(offset) methods) instead of just giving you a blob of bytes fully fetched in memory. That's probably something we will want to do in another diff. I initially considered suggesting giving Shard a mode parameter ('r' or 'w'), but after some consideration, and as the operations are always disjoint (i.e. you always look up objects, or you always write all the objects, but doing both is pretty rare), I think we should have two classes, one for a read-only shard (with only lookup methods), and one for a write-only shard (which takes a count of objects in the initializer, always wipes the shard, and always calls save to finalize the index in the __exit__ method, after the proper number of writes). The write-only class could also raise an exception if the number of writes doesn't match the expectation. Comment Actions
It was aded as docs/format.rst
Done.
Done.
Done.
Done.
I'm not sure what would be the upside so I kept it as u_int64_t. Do you have a specific scenario in mind where it would create a problem?
Good catch: ntohq/ntohq were missing the big endian counterpart (yerk!). Writing/reading structs directly is unecessarily difficult in this case, it was replaced with writing individual u_int64_t in network order instead.
This all makes perfect sense to me. However I'd rather table them until the package is actually used. I suspect the code using this module will drive the desired API although I'm unable to figure out what would be the most natural fit right now. I'm sensing refining the API at the moment would slightly be over engineering. But ... that may be me being lazy or lacking vision and I'll defer to your better judgement in any case :-) Comment Actions Build is green Patch application report for D6424 (id=24055)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 5897b0dd584b834b3bf8cd3739bcfba5b352552a Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/20/ for more details. Comment Actions Is u_int64_t something that comes from cmph ? That's pretty gratuitously inconsistent with the standard C uint64_t (don't bother changing it). Anyway, there's a couple of issues I can think of:
(can you guess why I dislike C?)
:-)
Yeah, that makes sense. Comment Actions
Good points. I made the following changes to address them:
I think I can :-) Comment Actions Build is green Patch application report for D6424 (id=24097)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 46cf1764d84648117ce35b54a2e1475873f964fc Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/21/ for more details. Comment Actions Build is green Patch application report for D6424 (id=24098)Rebasing onto 8229acfd16... Current branch diff-target is up to date. Changes applied before testcommit 9266eaa6ead429e1e754c539100d7449895fdb14 Author: Loïc Dachary <loic@dachary.org> Date: Tue Oct 12 16:54:40 2021 +0200 create and lookup a Read Shard with a perfect hash This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by: Loïc Dachary <loic@dachary.org> commit ed6bad56a50530ff091e59ba566cdf45a5349d06 Author: Loïc Dachary <loic@dachary.org> Date: Wed Oct 6 16:23:07 2021 +0200 C stub compiled via cffi and tested in python See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/22/ for more details. |