Page MenuHomeSoftware Heritage

NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID
ClosedPublic

Authored by seirl on Apr 6 2021, 3:43 PM.

Details

Summary

Right now we are generating two different binary mappings on disk for
the translation between SWHID <-> webgraph node ID:

  1. The node2swh.bin reverse map, which contains a list of binary SWHID. It allows O(1) access to the SWHID of a given node n by seek()ing to the position (n * record size).
  1. The swhid2node.bin map, which contains a list of <SWHID, node ID> ordered by SWHID. The node ID of a given SWHID can be found in O(log N) by doing a binary search on this list.

Because the swhid -> node binary search requires multiple seek() on the
disk, it is very slow and cannot be used for performance sensitive code.

The alternative route is to compute the node from the minimal perfect
hash function and then lookup the equivalent node in the permuted graph
using the .order file containing the permutation, which can be done in
O(1) and is extremely fast. However, this has two caveats:

  • The .order file is ~150 GB, which would be too big to load in memory.
  • MPH cannot check whether their input is valid. They could do so probabilistically if we signed them, but when replying to a query in the graph service, we want to be absolutely certain that a node is or is not present in the graph.

This code mitigates these problems in two ways. First, it memory-maps
the permutation file to avoid loading it in main memory. Second, it uses
a roundtrip check to detect invalid SWHIDs: we hash + permute the
SWHID, then use the reverse map to check that the obtained node ID is
associated to the original SWHID.

This is a big performance gain (basic benchmarks show a ~x3 speedup).
To go even faster, we offer a boolean option to skip the roundtrip
check, to use when we know that the input is always valid.

This will also allow us in the future to remove the swhid2node map
completely, however it is currently still in use by the Python frontend
to encode the SWHIDs. This will be done directly in the Java side in the
future.

Test Plan

local benchmarks + running the java tests

Diff Detail

Event Timeline

Build is green

Patch application report for D5427 (id=19397)

Rebasing onto f055c4eaf0...

Current branch diff-target is up to date.
Changes applied before test
commit 40a8f4a0844bc00e6b27239dc5842140b6ad1d89
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Apr 6 15:23:14 2021 +0200

    NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID
    
    Right now we are generating two different binary mappings on disk for
    the translation between SWHID <-> webgraph node ID:
    
    1. The node2swh.bin reverse map, which contains a list of binary SWHID.
    It allows O(1) access to the SWHID of a given node n by seek()ing to the
    position (n * record size).
    
    2. The swhid2node.bin map, which contains a list of <SWHID, node ID>
       ordered by SWHID. The node ID of a given SWHID can be found in
       O(log N) by doing a binary search on this list.
    
    Because the swhid -> node binary search requires multiple seek() on the
    disk, it is very slow and cannot be used for performance sensitive code.
    
    The alternative route is to compute the node from the minimal perfect
    hash function and then lookup the equivalent node in the permuted graph
    using the .order file containing the permutation, which can be done in
    O(1) and is extremely fast. However, this has two caveats:
    
    - The .order file is ~150 GB, which would be too big to load in memory.
    
    - MPH cannot check whether their input is valid. They could do so
      probabilistically if we signed them, but when replying to a query in
      the graph service, we want to be absolutely certain that a node is or
      is not present in the graph.
    
    This code mitigates these problems in two ways. First, it memory-maps
    the permutation file to avoid loading it in main memory. Second, it uses
    a roundtrip check to detect invalid SWHIDs: we hash + permute the
    SWHID, then use the reverse map to check that the obtained node ID is
    associated to the original SWHID.
    
    This is a big performance gain (basic benchmarks show a ~x3 speedup).
    To go even faster, we offer a boolean option to skip the roundtrip
    check, to use when we know that the input is always valid.
    
    This will also allow us in the future to remove the swhid2node map
    completely, however it is currently still in use by the Python frontend
    to encode the SWHIDs. This will be done directly in the Java side in the
    future.

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/72/ for more details.

seirl requested review of this revision.Apr 6 2021, 3:46 PM
vlorentz added a subscriber: vlorentz.

This sounds great!

Could you document it somewhere outside the code/history, eg. by copy-pasting your commit message in docs/?

java/src/main/java/org/softwareheritage/graph/maps/NodeIdMap.java
40

to introduce the acronym

79

if false, then invalid/inconsistent results will be returned without any warning

84

let's not assume the reader of this function also read the previous one

This revision now requires changes to proceed.Apr 7 2021, 10:33 AM

Thanks for the review. I don't think this needs to be documented elsewhere, it just describes why we're doing the change. What should be documented instead is why we're using these data structures in the first place. Right now this is done sparsely in the different source files, and this commit updates the already existing documentation.

(oops, sent too early)

Where are the .order and MPH computed?

Where are the .order and MPH computed?

In the graph compression pipeline, we already have them available at this step.

Thanks for the review. I don't think this needs to be documented elsewhere, it just describes why we're doing the change. What should be documented instead is why we're using these data structures in the first place.

Sure

Right now this is done sparsely in the different source files, and this commit updates the already existing documentation.

I don't think that's good enough. We should have an overview of swh-graph's design that doesn't require reading all the code in an unspecified order.
And reading the code does not give a rationale for the decision.

I'm not saying the current state of the docs is good enough, I'm saying this commit message doesn't explain the design but why we're moving away from the old binary search solution. The new way of doing things is a lot more natural thing to do since we already have the MPH and the .order file, so there's no need to document why the old solution was bad in the main docs.

I don't think that's good enough. We should have an overview of swh-graph's design that doesn't require reading all the code in an unspecified order.
And reading the code does not give a rationale for the decision.

You raise a good point here @vlorentz, but it's outside the scope of this diff. It would be nice to have a separate task with this request and your point of view on what's actually needed, as someone who is less familiar than @seirl with the code, would be very welcome to have there.

The new way of doing things is a lot more natural thing to do since we already have the MPH and the .order file

Newcomers aren't aware of this. I had no idea we had those before reading this diff.

And using a MPH + a permutation + and a lookup in a reverse map is not more natural than simply a lookup in a map.

but it's outside the scope of this diff

I'm not asking this diff to document the entire design of swh-graph, just the part that was changed.

The new way of doing things is a lot more natural thing to do since we already have the MPH and the .order file

Newcomers aren't aware of this. I had no idea we had those before reading this diff.

That part is documented here: https://docs.softwareheritage.org/devel/swh-graph/compression.html

And using a MPH + a permutation + and a lookup in a reverse map is not more natural than simply a lookup in a map.

It is more natural, because the MPH + permutation is what allows us to build the map in the first place. We're just removing the map step here.

I'm not asking this diff to document the entire design of swh-graph, just the part that was changed.

But I can't do "just" that since the design doc of swh-graph where I would put this part doesn't exist yet, so the only appropriate place to put this in the meantime is the code. I'm happy to file a task to document the data model in more detail.

In D5427#138056, @seirl wrote:

That part is documented here: https://docs.softwareheritage.org/devel/swh-graph/compression.html

And using a MPH + a permutation + and a lookup in a reverse map is not more natural than simply a lookup in a map.

Great, thanks! Can you add a link to this page from the docstring at the top?

It is more natural, because the MPH + permutation is what allows us to build the map in the first place. We're just removing the map step here.

It's only natural if you're familiar with swh-graph's design...

This revision is now accepted and ready to land.Apr 7 2021, 3:06 PM
  • Fix reviews
  • Add backward compatibility for loading MPH on strings

Build is green

Patch application report for D5427 (id=19580)

Rebasing onto 15c2da0f08...

Current branch diff-target is up to date.
Changes applied before test
commit 4f751998c69c719173e02d303a14c8b684d650b1
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Fri Apr 9 15:45:00 2021 +0200

    NodeIdMap: add backward compatibility for loading MPH on strings

commit 53bbd5c65cbe6cacc398f8b0a08696f520b2ba0f
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Apr 6 15:23:14 2021 +0200

    NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID
    
    Right now we are generating two different binary mappings on disk for
    the translation between SWHID <-> webgraph node ID:
    
    1. The node2swh.bin reverse map, which contains a list of binary SWHID.
    It allows O(1) access to the SWHID of a given node n by seek()ing to the
    position (n * record size).
    
    2. The swhid2node.bin map, which contains a list of <SWHID, node ID>
       ordered by SWHID. The node ID of a given SWHID can be found in
       O(log N) by doing a binary search on this list.
    
    Because the swhid -> node binary search requires multiple seek() on the
    disk, it is very slow and cannot be used for performance sensitive code.
    
    The alternative route is to compute the node from the minimal perfect
    hash function and then lookup the equivalent node in the permuted graph
    using the .order file containing the permutation, which can be done in
    O(1) and is extremely fast. However, this has two caveats:
    
    - The .order file is ~150 GB, which would be too big to load in memory.
    
    - MPH cannot check whether their input is valid. They could do so
      probabilistically if we signed them, but when replying to a query in
      the graph service, we want to be absolutely certain that a node is or
      is not present in the graph.
    
    This code mitigates these problems in two ways. First, it memory-maps
    the permutation file to avoid loading it in main memory. Second, it uses
    a roundtrip check to detect invalid SWHIDs: we hash + permute the
    SWHID, then use the reverse map to check that the obtained node ID is
    associated to the original SWHID.
    
    This is a big performance gain (basic benchmarks show a ~x3 speedup).
    To go even faster, we offer a boolean option to skip the roundtrip
    check, to use when we know that the input is always valid.
    
    This will also allow us in the future to remove the swhid2node map
    completely, however it is currently still in use by the Python frontend
    to encode the SWHIDs. This will be done directly in the Java side in the
    future.

See https://jenkins.softwareheritage.org/job/DGRPH/job/tests-on-diff/74/ for more details.