Right now we are generating two different binary mappings on disk for
the translation between SWHID <-> webgraph node ID:
- The node2swh.bin reverse map, which contains a list of binary SWHID. It allows O(1) access to the SWHID of a given node n by seek()ing to the position (n * record size).
- The swhid2node.bin map, which contains a list of <SWHID, node ID> ordered by SWHID. The node ID of a given SWHID can be found in O(log N) by doing a binary search on this list.
Because the swhid -> node binary search requires multiple seek() on the
disk, it is very slow and cannot be used for performance sensitive code.
The alternative route is to compute the node from the minimal perfect
hash function and then lookup the equivalent node in the permuted graph
using the .order file containing the permutation, which can be done in
O(1) and is extremely fast. However, this has two caveats:
- The .order file is ~150 GB, which would be too big to load in memory.
- MPH cannot check whether their input is valid. They could do so probabilistically if we signed them, but when replying to a query in the graph service, we want to be absolutely certain that a node is or is not present in the graph.
This code mitigates these problems in two ways. First, it memory-maps
the permutation file to avoid loading it in main memory. Second, it uses
a roundtrip check to detect invalid SWHIDs: we hash + permute the
SWHID, then use the reverse map to check that the obtained node ID is
associated to the original SWHID.
This is a big performance gain (basic benchmarks show a ~x3 speedup).
To go even faster, we offer a boolean option to skip the roundtrip
check, to use when we know that the input is always valid.
This will also allow us in the future to remove the swhid2node map
completely, however it is currently still in use by the Python frontend
to encode the SWHIDs. This will be done directly in the Java side in the
future.