Page MenuHomeSoftware Heritage

Reduce RAM usage for generating mapping files
Open, NormalPublic

Description

Right now a temporary array is used to correctly order elements and then write sequentially to the file. However this uses 2TB of RAM on the entire graph. Other solutions were tried in D1802 but they all were way to slow to be doable in practice.

Event Timeline

haltode triaged this task as Normal priority.Aug 10 2019, 9:22 AM
haltode created this task.
haltode created this object in space S1 Public.
zack renamed this task from Implement mapping files dumping with less RAM usage to Reduce RAM usage for generating mapping files.Aug 10 2019, 3:45 PM
zack changed the status of subtask T1944: use a compact, binary format for node ids mapping files from Open to Work in Progress.Sep 13 2019, 1:19 PM
zack added subscribers: seirl, zack.Sep 13 2019, 1:22 PM

Neither of the two spectrum endpoints "fully sort in RAM then write sequentially" and "write randomly" is satisfactory here.
What we want is: in memory sorting within the limits allowed by available RAM + swapon/swapoff of partially sorted subsets + sequential write at the end.
We can implement this in Java in the Setup class, but, in fact, that is exactly what /usr/bin/sort is good at doing. So I propose to shell out to it from Setup and serialize sort result to a writer for the binary format of T1944.