Description

Right now a temporary array is used to correctly order elements and then write sequentially to the file. However this uses 2TB of RAM on the entire graph. Other solutions were tried in D1802 but they all were way to slow to be doable in practice.

Revisions and Commits

rDGRPH Compressed graph representation
	rDGRPH6d2f04b4d5a4 Setup.java: shell out node2pid map generation to sort

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T1950 Reduce RAM usage for generating mapping files
		Migrated	gitlab-migration	T1944 use a compact, binary format for node ids mapping files

Event Timeline

haltode triaged this task as Normal priority.Aug 10 2019, 9:22 AM

haltode created this task.

haltode created this object in space S1 Public.

haltode added a subtask: T1944: use a compact, binary format for node ids mapping files.

zack renamed this task from Implement mapping files dumping with less RAM usage to Reduce RAM usage for generating mapping files.Aug 10 2019, 3:45 PM

zack changed the status of subtask T1944: use a compact, binary format for node ids mapping files from Open to Work in Progress.Sep 13 2019, 1:19 PM

Neither of the two spectrum endpoints "fully sort in RAM then write sequentially" and "write randomly" is satisfactory here.
What we want is: in memory sorting within the limits allowed by available RAM + swapon/swapoff of partially sorted subsets + sequential write at the end.
We can implement this in Java in the Setup class, but, in fact, that is exactly what /usr/bin/sort is good at doing. So I propose to shell out to it from Setup and serialize sort result to a writer for the binary format of T1944.