Page MenuHomeSoftware Heritage

C stub compiled via cffi and tested in python
Needs ReviewPublic

Authored by dachary on Wed, Oct 6, 4:25 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3634: Create swh-perfecthash module
Summary

Related to T3634

Diff Detail

Repository
rDOPH Perfect Hash algorithm
Branch
wip-hash (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24391
Build 38071: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 38070: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6424 (id=23343)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/1/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/1/console

Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Oct 6, 4:26 PM
Harbormaster failed remote builds in B24257: Diff 23343!

Build has FAILED

Patch application report for D6424 (id=23343)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/2/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/2/console

Build is green

Patch application report for D6424 (id=23343)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/3/ for more details.

@dachary It'd be nice if you could describe what this is about in the commit message and
the diff description (if you actually provide a commit description, then when you create
the diff, the commit message is used as a description bootstrap). I know it's more work
for you but it happens that:

  1. it helps the reviewers to have some context directly here (without having to follow

between a multitude of tasks. FYI, I've followed through the task but it's not enough,
i need to also dig in that arborescence of tasks).

  1. is also how we are doing that in every other modules ;)
  1. the curious could learn a thing or 2 even if they don't do a proper review.

Please and thanks in advance.

Cheers,

@dachary It'd be nice if you could describe what this is about in the commit message and
the diff description (if you actually provide a commit description, then when you create
the diff, the commit message is used as a description bootstrap). I know it's more work
for you but it happens that:

  1. it helps the reviewers to have some context directly here (without having to follow

between a multitude of tasks. FYI, I've followed through the task but it's not enough,
i need to also dig in that arborescence of tasks).

  1. is also how we are doing that in every other modules ;)
  1. the curious could learn a thing or 2 even if they don't do a proper review.

Please and thanks in advance.

Cheers,

FWIW I see this mostly as a test to see if the scaffolding for CFFI (and the cmph library) would work on our CI. I assume some more relevant code will be coming in better described diffs.

If this change is to be landed as is, then all the #if 0 code should go away, and the descriptions should be indeed improved.

I'm curious to know where the libcmph-dev dependency was added?

@dachary It'd be nice if you could describe what this is about in the commit message and
the diff description (if you actually provide a commit description, then when you create
the diff, the commit message is used as a description bootstrap). I know it's more work
for you but it happens that:

  1. it helps the reviewers to have some context directly here (without having to follow

between a multitude of tasks. FYI, I've followed through the task but it's not enough,
i need to also dig in that arborescence of tasks).

  1. is also how we are doing that in every other modules ;)
  1. the curious could learn a thing or 2 even if they don't do a proper review.

Please and thanks in advance.

Cheers,

I'll make sure to do that in the upcoming patches. As @olasd wrote, the merit of this patch is merely to set the stage and make sure the module can be compiled against the cmph library. The code between #if 0 is irrelevant and I can remove it to avoid confusion if you think it is better. Just let me know :-)

I'll make sure to do that in the upcoming patches. As @olasd wrote, the merit of this patch is merely to set the stage and make sure the module can be compiled against the cmph library. The code between #if 0 is irrelevant and I can remove it to avoid confusion if you think it is better. Just let me know :-)

Thanks. Great.

jsyk, even a small remark on that diff description to say that at first "debug building purposes" would have been fine ;)

Cheers,

Build has FAILED

Patch application report for D6424 (id=23445)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit c6bb435a0e98d34537d5cb06393251956a24d85a
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 16:08:50 2021 +0200

    skeleton

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/4/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/4/console

skeleton

Still very much in draft stage. I need to figure out how to create a python buffer from the pointer/size returned by the lookup function. The general idea is however sound, I think.

  • Open the Read Shard
    • Open a file (a /dev/rbd/abc device really)
    • mmap the file
    • read the header
  • Creating the Read Shard
    • write each object from python (key + object)
    • collect all keys and the object offset
    • when write is complete, proceed with (i) creating the hash, (ii) creating the index and writing it after the objects, (iii) writing the hash function after the index
  • Reading the Read Shard
    • load the hash function
    • lookup the position in the index using the hash function with the key
    • load the object from the position obtained from the index and return it

There is no object allocation at read time in C, the returned pointer is to a mmap'ed section of the device.

The hash function is loaded once per shard and should be shared between workers (mmap flag does that). A lookup accesses the file 2 times: once to fetch the index content, once when the object is read from python.

Build has FAILED

Patch application report for D6424 (id=23458)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 6a2e8dc73f35b51b7915b875ebfd4035efb8b737
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 22:27:15 2021 +0200

    compiles but crashes

commit c6bb435a0e98d34537d5cb06393251956a24d85a
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 16:08:50 2021 +0200

    skeleton

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/5/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/5/console

working implementation, with tests

Build has FAILED

Patch application report for D6424 (id=23469)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit e4ef0feb30c3c84e4eccd12d139a8194e4aa1824
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 14:48:21 2021 +0200

    working implementation

commit ce93bfd09e83463b3c9f2f111fdb784cc44dd429
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 01:09:03 2021 +0200

    run tests with valgrind and coverage report

commit 0aac43f62756374e1e899fd31378439e15d337ea
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 00:04:12 2021 +0200

    minimal test valgrind clean

commit ca62be2b966bc0a1a90e0728d07571a3f35fd818
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 23:50:01 2021 +0200

    indent 4

commit 363be54f5d7d28763eb923c10c55d52721534911
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 23:36:16 2021 +0200

    clang-format

commit 237f7994fbde9c724d1a1b9112dd1f82230843d4
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 23:32:08 2021 +0200

    temporary directory

commit b744f19529fe59f382ca0b427dd445b6b7565123
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 22:44:31 2021 +0200

    ignore compiled files

commit 6a2e8dc73f35b51b7915b875ebfd4035efb8b737
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 22:27:15 2021 +0200

    compiles but crashes

commit c6bb435a0e98d34537d5cb06393251956a24d85a
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 16:08:50 2021 +0200

    skeleton

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/6/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/6/console

This is a working C implementation with tests and decent code coverage (all but error conditions). The run of the tests is valgrind clean. Next step is to run the python tests. Then wrap up and document so it can be reviewed.

create and lookup a Read Shard with a perfect hash

Build has FAILED

Patch application report for D6424 (id=23473)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit dbc5e3244ac76605417f38c134c49ff55960ab92
Author: Loïc Dachary <loic@dachary.org>
Date:   Mon Oct 11 16:08:50 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This new package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal). The content of
    the Read Shard is mmap(2): the module does not maintain an in memory
    cache.
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

Link to build: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/7/
See console output for more information: https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/7/console

@ardumont @olasd this is ready for review... I think :-D

create and lookup a Read Shard with a perfect hash

Build is green

Patch application report for D6424 (id=23474)

Rebasing onto 8229acfd16...

Current branch diff-target is up to date.
Changes applied before test
commit 65f7d06b2bfe0830bef9c15abeded9128673943c
Author: Loïc Dachary <loic@dachary.org>
Date:   Tue Oct 12 16:54:40 2021 +0200

    create and lookup a Read Shard with a perfect hash
    
    This new package is intended to be used by the new object storage, as
    a low level dependency to create and lookup a Read Shard.
    
    It is implemented in C and based on the cmph library for better
    performances. It will be used when a Read Shard must be created with
    around fifty millions objects, totaling around 100GB.
    
    The objects and their key (their cryptographic signature) will be
    retrieved, in python from the postgres database where the Write Shard
    lives. One after the other they will be inserted in the Read Shard
    using the **write** method. In the end the **save** method will create
    the perfect hash table using the cmph library and store it in the
    file (it typically takes a few seconds).
    
    There is no write amplification during the creation of the Read Shard:
    each byte is written exactly once, sequentially. There is no read
    operation. The memory footprint is 2*n*32 where n is the number of
    inserted keys.
    
    The **lookup** method relies on the hash function which is loaded in
    memory when the **load** function is called. It obtains the offset of
    the object by looking up its offset in the file from an index which
    may be up to 2x the number of keys (it is not minimal). The content of
    the Read Shard is mmap(2): the module does not maintain an in memory
    cache.
    
    Signed-off-by: Loïc Dachary <loic@dachary.org>

commit ed6bad56a50530ff091e59ba566cdf45a5349d06
Author: Loïc Dachary <loic@dachary.org>
Date:   Wed Oct 6 16:23:07 2021 +0200

    C stub compiled via cffi and tested in python

See https://jenkins.softwareheritage.org/job/DOPH/job/tests-on-diff/8/ for more details.