Page MenuHomeSoftware Heritage

use a compact, binary format for node ids mapping files
Started, Work in Progress, NormalPublic

Description

We need a more compact way of storing the PID <-> long id mappings (i.e.: binary file).

Should also help with D1802.

Event Timeline

haltode triaged this task as Normal priority.Aug 8 2019, 10:29 AM
haltode created this task.
haltode created this object in space S1 Public.
zack renamed this task from More compact format for node ids mapping files to use a compact, binary format for node ids mapping files.Aug 8 2019, 1:00 PM
zack changed the task status from Open to Work in Progress.Sep 13 2019, 1:19 PM
zack added subscribers: seirl, zack.

Status update: we have now binary serialization formats for the two maps, see docstrings of PidToIntMap and IntToPidMap in swh.graph.pid
That means that Python code can read the compact maps (and also write them, but at a speed that is impractical for generation). Conversion of the textual maps generated for the most recent compressed graph is ongoing and almost completed.

The generation of those maps will need to happen on the Java side, and it's still pending.