We need a more compact way of storing the PID <-> long id mappings (i.e.: binary file).
Should also help with D1802.
We need a more compact way of storing the PID <-> long id mappings (i.e.: binary file).
Should also help with D1802.
rDGRPH Compressed graph representation | |||
rDGRPH998a44353612 switch Java map generation from CSV to binary format | |||
rDGRPH7c40a7d2b722 switch Java map generation from CSV to binary format |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1950 Reduce RAM usage for generating mapping files | ||
Migrated | gitlab-migration | T1944 use a compact, binary format for node ids mapping files |
Status update: we have now binary serialization formats for the two maps, see docstrings of PidToIntMap and IntToPidMap in swh.graph.pid
That means that Python code can read the compact maps (and also write them, but at a speed that is impractical for generation). Conversion of the textual maps generated for the most recent compressed graph is ongoing and almost completed.
The generation of those maps will need to happen on the Java side, and it's still pending.