Right now a temporary array is used to correctly order elements and then write sequentially to the file. However this uses 2TB of RAM on the entire graph. Other solutions were tried in D1802 but they all were way to slow to be doable in practice.
Description
Description
Revisions and Commits
Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1950 Reduce RAM usage for generating mapping files | ||
Migrated | gitlab-migration | T1944 use a compact, binary format for node ids mapping files |
Event Timeline
Comment Actions
Neither of the two spectrum endpoints "fully sort in RAM then write sequentially" and "write randomly" is satisfactory here.
What we want is: in memory sorting within the limits allowed by available RAM + swapon/swapoff of partially sorted subsets + sequential write at the end.
We can implement this in Java in the Setup class, but, in fact, that is exactly what /usr/bin/sort is good at doing. So I propose to shell out to it from Setup and serialize sort result to a writer for the binary format of T1944.