Differential D7839 Diff 28444 docs/memory.rst

Changeset View

Standalone View

View Options

docs/memory.rst

This file was added.

.. _swh-graph-memory:

JaredR26Unsubmitted

Not Done

I suggest adding a header for java types and performance suggestions.

If someone is intending to traverse or iterate the big graph for anything meaningful, they almost certainly want to plan to do so multithreaded from the very beginning of their code.

Here's a sample approach I have adopted. I have no idea if it is good or bad or can be improved, but it does function. This code should iterate all of the labelnames with NUM_THREADS and count the number of ".cpp" labelnames that are longer than 50 characters, and return the longest filename found & print both (not something I need to do but a simple example with enough depth to be modified to fit many uses I think).

		class longFileNameReturn { public String longestName; public long numLongNamesFound; }
		class longFileNameFinder implements Callable<longFileNameReturn> {
			public int workerId;
			public int numWorkers;
			public String graphPath;
			public longFileNameReturn call() throws IOException {
				SwhGraphProperties props = SwhGraphProperties.load(graphPath);
				props.loadContentLength();
				String longestFilename = "";
				long numLongFilesFound = 0;
				long numFilenames = getMPHSize(NodeIdMap.loadMph(graphPath + ".labels.mph"));
				// Iterate...
				for (long currNameId = workerId; currNameId < numFilenames; currNameId += numWorkers) {
					String labelName = new String(props.getLabelName(currNameId));
					if (labelName.endsWith(".cpp") && labelName.length() > 50) {
						System.out.println("Found long filename " + labelName);
						numLongFilesFound++;
						if (labelName.length() > longestFilename.length()) {
							longestFilename = labelName;
						}
					}
				}
				var ret = new longFileNameReturn();
				ret.longestName = longestFilename;
				ret.numLongNamesFound = numLongFilesFound;
				return ret;
			}
		}
		ExecutorService executorService = Executors.newFixedThreadPool(NUM_THREADS);
		var results = new ArrayList<Future<longFileNameReturn>>();
		for(int i = 0;i<NUM_THREADS;i++) {
			var worker = new longFileNameFinder();
			worker.workerId = i;
			worker.numWorkers = NUM_THREADS;
			worker.graphPath = graphPath;
			results.add(executorService.submit(worker));
		}
		for(int i = 0;i<NUM_THREADS;i++) {
			var res = results.get(i).get();
			System.out.println("Thread "+i+" found "+res.numLongNamesFound+" long filenames and the longest was: "+res.longestName);
		}
		executorService.shutdown();

The DSI libraries are your best friend. If tracking nodes you've already seen when traversing, a HashSet<Long> stores 16 bytes per node touched and 8 bytes for empty buckets in the HashSet. Traversing the whole full dataset would oversize the array and fail, but even if you used multiple arrays it would take up ~328 gigabytes of ram just to track seen nodes. If you instead make a dsi.bits.LongArrayBitVector, you only need 1 bit per node in the graph, or about 2.5 gigabytes of ram, and a single array version of that will continue working until the SWH dataset has ~137 billion nodes.

Similarly, LongArrayList saves at least 8 bytes per entry over ArrayList<Long> and Long2LongOpenHashMap saves at least 16 bytes over every entry over HashMap<Long, Long>.

When traversing, always always track and curtail repeated visits to the same nodes. The SWH graph provides an enormous number of paths to reach the same set of nodes. A simple single-origin traversal that appears to work fine on a randomly selected origin in the python3k dataset might take days on the full dataset if repeated node visits aren't short-circuited.

Those are the tips I have that I think would be good to add somewhere on this page.

JaredR26: I suggest adding a header for java types and performance suggestions. 1. If someone is…

JaredR26Unsubmitted

Done

Oh! Oh! And make sure to mention the graph.copy() process for multithreaded! My example doesn't show that, but it's super important that that's documented somewhere when multithreaded or graph loading it mentioned.

JaredR26: Oh! Oh! And make sure to mention the graph.copy() process for multithreaded! My example…

Memory & Performance tuning

===========================

This page discusses various considerations related to memory usage and

performance tuning when using the ``swh-graph`` library to load large

compressed graphs.

JVM options

-----------

In production, we tend to use very large servers which have enough RAM to load

the entire graph in RAM. In these setups, the default JVM options are often

suboptimal. We recommend to start the JVM with the following options, which

tend to significantly improve performance::

java \

-ea \

-server \

-XX:PretenureSizeThreshold=512M \

-XX:MaxNewSize=4G \

-XX:+UseLargePages \

-XX:+UseTransparentHugePages \

-XX:+UseNUMA \

-XX:+UseTLAB \

-XX:+ResizeTLAB \

JaredR26Unsubmitted

Done

tend to significantly improve performance::

java \

+ -ea \

-server \

-XX:PretenureSizeThreshold=512M \

-XX:MaxNewSize=4G \

-XX:+UseLargePages \

-XX:+UseTransparentHugePages \

-XX:+UseNUMA \

-XX:+UseTLAB \

-XX:+ResizeTLAB \

These options are documented in the manual of ``java(1)`` the Oracle

Suggest adding -ea to enable assertions by default. I didn't realize Java disabled them by default until later. If they want them disabled, they will figure it out when they have a need to.

JaredR26: Suggest adding -ea to enable assertions by default. I didn't realize Java disabled them by…

These options are documented in the manual of ``java(1)`` the Oracle

documentation.

Temporary directory

-------------------

Many of the graph algorithms (either for compression or traversal) tend to

offload some of their run-time memory to disk. For instance, the `BFS

<https://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_

algorithm in the LAW library uses a temporary directory to write its queue of

nodes to visit.

Because these can be quite large and sometimes overflow the default ``/tmp``

partition, it is advised to systematically specify a path to a local temporary

directory with enough space to accommodate the needs of the Java programs. This

can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI::

java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp

Memory mapping vs Direct loading

--------------------------------

The main dial you can use to manage your memory usage is to chose between

memory-mapping and direct-loading the graph data. The different loading modes

available when loading the graph are documented in :ref:`swh-graph-java-api`.

Loading in mapped mode will not load any extra data in RAM, but will instead

use the ``mmap(1)`` syscall to put the graph file located on disk in the

virtual address space. The Linux kernel will then be free to arbitrarily cache

the file, either partially or in its entirety, depending on the available

memory space.

In our experiments, memory-mapping a small graph from a SSD only incurs a

relatively small slowdown (about 15-20%). However, when the graph is too big to

fit in RAM, the kernel has to constantly invalidate pages to cache newly

accessed sections, which incurs a very large performance penalty. A full

traversal of a large graph that usually takes about 20 hours when loaded in

main memory could take more than a year when mapped from a hard drive!

When deciding what to direct-load and what to memory-map, here are a few rules

of thumb:

- If you don't need random access to the graph edges, you can consider using

the "offline" loading mode. The offsets won't be loaded which will save

vlorentzUnsubmitted

Done

what use cases does this apply to?

vlorentz: what use cases does this apply to?

seirlAuthorUnsubmitted

Done

Anything where you can just scan the graph sequentially, like if you want to compute stats on indegrees / outdegrees, count nodes, etc.

seirl: Anything where you can just scan the graph sequentially, like if you want to compute stats on…

dozens of gigabytes of RAM.

- If you only need to query some specific nodes or run trivial traversals,

memory-mapping the graph from a HDD should be a reasonable solution that

doesn't take an inordinate amount of time. It might be bad for your disks,

though.

- If you are constrained in available RAM, memory-mapping the graph from an SSD

offers reasonable performance for reasonably complex algorithms.

- If you have a heavy workload (i.e. running a full traversal of the entire

graph) and you can afford the RAM, direct loading will be orders of magnitude

faster than all the above options.

Sharing mapped data across processes

------------------------------------

Often, multiple processes can be working on the same data (mappings or the

graph itself), for instance when running different experiments on the same

graph. This is problematic in terms of RAM usage when using direct memory

loading, as the same data of potentially hundreds of gigabytes is loaded in

memory twice.

As we have seen, memory-mapping can be used to avoid storing redundant data in

RAM, but comes at the cost of potentially slower I/O as the data is no longer

guaranteed to be loaded in main memory and is reliant on kernel heuristics.

To efficiently share data across two different compressed graph processes,

another option is to copy graph data to a ``tmpfs`` not backed by a disk swap,

which forces the kernel to load it entirely in RAM. Subsequent memory-mappings

of the files stored in the tmpfs will simply map the data stored in RAM to

virtual memory pages, and return a pointer to the in-memory structure.

To do so, we create a directory in ``/dev/shm`` in which we **copy** all the

files that we want to direct-load in RAM, and **symlink** all the others. Then,

we load the graph using the memory-mapped loading mode, which makes it use the

shared memory stored in the tmpfs under the hood.

Here is a systemd service that can be used to perform this task automatically:

.. code-block:: ini

[Unit]

Description=swh-graph memory sharing in tmpfs

[Service]

Type=oneshot

RemainAfterExit=yes

ExecStart=mkdir -p /dev/shm/swh-graph/default

ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default"

ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default

vlorentzUnsubmitted

Done

why sh?

vlorentz: why `sh`?

seirlAuthorUnsubmitted

Done

Because I need globbing

seirl: Because I need globbing

seirlAuthorUnsubmitted

Done

(Not here actually, fixing this.)

seirl: (Not here actually, fixing this.)

ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default

ExecStop=rm -rf /dev/shm/swh-graph/default

[Install]

WantedBy=multi-user.target