Right now a temporary array is used to correctly order elements and then write sequentially to the file. However this uses 2TB of RAM on the entire graph. Other solutions were tried in D1802 but they all were way to slow to be doable in practice.
|Resolved||zack||T1950 Reduce RAM usage for generating mapping files|
|Resolved||zack||T1944 use a compact, binary format for node ids mapping files|
- Mentioned In
- rDGRPH545be725d34b: webgraph.py: autoatically generate mappings at the end of compression
- Mentioned Here
- rDGRPH6d2f04b4d5a4: Setup.java: shell out node2pid map generation to sort
T1944: use a compact, binary format for node ids mapping files
D1802: [WIP] server: setup: use RandomAccessFile instead of temporary array
Neither of the two spectrum endpoints "fully sort in RAM then write sequentially" and "write randomly" is satisfactory here.
What we want is: in memory sorting within the limits allowed by available RAM + swapon/swapoff of partially sorted subsets + sequential write at the end.
We can implement this in Java in the Setup class, but, in fact, that is exactly what /usr/bin/sort is good at doing. So I propose to shell out to it from Setup and serialize sort result to a writer for the binary format of T1944.