Changeset View
Changeset View
Standalone View
Standalone View
docs/memory.rst
- This file was added.
.. _swh-graph-memory: | |||||||||||||||||||
JaredR26: I suggest adding a header for java types and performance suggestions.
1. If someone is… | |||||||||||||||||||
Done Inline ActionsOh! Oh! And make sure to mention the graph.copy() process for multithreaded! My example doesn't show that, but it's super important that that's documented somewhere when multithreaded or graph loading it mentioned. JaredR26: Oh! Oh! And make sure to mention the graph.copy() process for multithreaded! My example… | |||||||||||||||||||
Memory & Performance tuning | |||||||||||||||||||
=========================== | |||||||||||||||||||
This page discusses various considerations related to memory usage and | |||||||||||||||||||
performance tuning when using the ``swh-graph`` library to load large | |||||||||||||||||||
compressed graphs. | |||||||||||||||||||
JVM options | |||||||||||||||||||
----------- | |||||||||||||||||||
In production, we tend to use very large servers which have enough RAM to load | |||||||||||||||||||
the entire graph in RAM. In these setups, the default JVM options are often | |||||||||||||||||||
suboptimal. We recommend to start the JVM with the following options, which | |||||||||||||||||||
tend to significantly improve performance:: | |||||||||||||||||||
java \ | |||||||||||||||||||
-ea \ | |||||||||||||||||||
-server \ | |||||||||||||||||||
-XX:PretenureSizeThreshold=512M \ | |||||||||||||||||||
-XX:MaxNewSize=4G \ | |||||||||||||||||||
-XX:+UseLargePages \ | |||||||||||||||||||
-XX:+UseTransparentHugePages \ | |||||||||||||||||||
-XX:+UseNUMA \ | |||||||||||||||||||
-XX:+UseTLAB \ | |||||||||||||||||||
-XX:+ResizeTLAB \ | |||||||||||||||||||
Done Inline Actions
Suggest adding -ea to enable assertions by default. I didn't realize Java disabled them by default until later. If they want them disabled, they will figure it out when they have a need to. JaredR26: Suggest adding -ea to enable assertions by default. I didn't realize Java disabled them by… | |||||||||||||||||||
These options are documented in the manual of ``java(1)`` the Oracle | |||||||||||||||||||
documentation. | |||||||||||||||||||
Temporary directory | |||||||||||||||||||
------------------- | |||||||||||||||||||
Many of the graph algorithms (either for compression or traversal) tend to | |||||||||||||||||||
offload some of their run-time memory to disk. For instance, the `BFS | |||||||||||||||||||
<https://law.di.unimi.it/software/law-docs/it/unimi/dsi/law/big/graph/BFS.html>`_ | |||||||||||||||||||
algorithm in the LAW library uses a temporary directory to write its queue of | |||||||||||||||||||
nodes to visit. | |||||||||||||||||||
Because these can be quite large and sometimes overflow the default ``/tmp`` | |||||||||||||||||||
partition, it is advised to systematically specify a path to a local temporary | |||||||||||||||||||
directory with enough space to accommodate the needs of the Java programs. This | |||||||||||||||||||
can be done using the ``-Djava.io.tmpdir`` parameter on the Java CLI:: | |||||||||||||||||||
java -Djava.io.tmpdir=/srv/softwareheritage/ssd/tmp | |||||||||||||||||||
Memory mapping vs Direct loading | |||||||||||||||||||
-------------------------------- | |||||||||||||||||||
The main dial you can use to manage your memory usage is to chose between | |||||||||||||||||||
memory-mapping and direct-loading the graph data. The different loading modes | |||||||||||||||||||
available when loading the graph are documented in :ref:`swh-graph-java-api`. | |||||||||||||||||||
Loading in mapped mode will not load any extra data in RAM, but will instead | |||||||||||||||||||
use the ``mmap(1)`` syscall to put the graph file located on disk in the | |||||||||||||||||||
virtual address space. The Linux kernel will then be free to arbitrarily cache | |||||||||||||||||||
the file, either partially or in its entirety, depending on the available | |||||||||||||||||||
memory space. | |||||||||||||||||||
In our experiments, memory-mapping a small graph from a SSD only incurs a | |||||||||||||||||||
relatively small slowdown (about 15-20%). However, when the graph is too big to | |||||||||||||||||||
fit in RAM, the kernel has to constantly invalidate pages to cache newly | |||||||||||||||||||
accessed sections, which incurs a very large performance penalty. A full | |||||||||||||||||||
traversal of a large graph that usually takes about 20 hours when loaded in | |||||||||||||||||||
main memory could take more than a year when mapped from a hard drive! | |||||||||||||||||||
When deciding what to direct-load and what to memory-map, here are a few rules | |||||||||||||||||||
of thumb: | |||||||||||||||||||
- If you don't need random access to the graph edges, you can consider using | |||||||||||||||||||
the "offline" loading mode. The offsets won't be loaded which will save | |||||||||||||||||||
Done Inline Actionswhat use cases does this apply to? vlorentz: what use cases does this apply to? | |||||||||||||||||||
Done Inline ActionsAnything where you can just scan the graph sequentially, like if you want to compute stats on indegrees / outdegrees, count nodes, etc. seirl: Anything where you can just scan the graph sequentially, like if you want to compute stats on… | |||||||||||||||||||
dozens of gigabytes of RAM. | |||||||||||||||||||
- If you only need to query some specific nodes or run trivial traversals, | |||||||||||||||||||
memory-mapping the graph from a HDD should be a reasonable solution that | |||||||||||||||||||
doesn't take an inordinate amount of time. It might be bad for your disks, | |||||||||||||||||||
though. | |||||||||||||||||||
- If you are constrained in available RAM, memory-mapping the graph from an SSD | |||||||||||||||||||
offers reasonable performance for reasonably complex algorithms. | |||||||||||||||||||
- If you have a heavy workload (i.e. running a full traversal of the entire | |||||||||||||||||||
graph) and you can afford the RAM, direct loading will be orders of magnitude | |||||||||||||||||||
faster than all the above options. | |||||||||||||||||||
Sharing mapped data across processes | |||||||||||||||||||
------------------------------------ | |||||||||||||||||||
Often, multiple processes can be working on the same data (mappings or the | |||||||||||||||||||
graph itself), for instance when running different experiments on the same | |||||||||||||||||||
graph. This is problematic in terms of RAM usage when using direct memory | |||||||||||||||||||
loading, as the same data of potentially hundreds of gigabytes is loaded in | |||||||||||||||||||
memory twice. | |||||||||||||||||||
As we have seen, memory-mapping can be used to avoid storing redundant data in | |||||||||||||||||||
RAM, but comes at the cost of potentially slower I/O as the data is no longer | |||||||||||||||||||
guaranteed to be loaded in main memory and is reliant on kernel heuristics. | |||||||||||||||||||
To efficiently share data across two different compressed graph processes, | |||||||||||||||||||
another option is to copy graph data to a ``tmpfs`` not backed by a disk swap, | |||||||||||||||||||
which forces the kernel to load it entirely in RAM. Subsequent memory-mappings | |||||||||||||||||||
of the files stored in the tmpfs will simply map the data stored in RAM to | |||||||||||||||||||
virtual memory pages, and return a pointer to the in-memory structure. | |||||||||||||||||||
To do so, we create a directory in ``/dev/shm`` in which we **copy** all the | |||||||||||||||||||
files that we want to direct-load in RAM, and **symlink** all the others. Then, | |||||||||||||||||||
we load the graph using the memory-mapped loading mode, which makes it use the | |||||||||||||||||||
shared memory stored in the tmpfs under the hood. | |||||||||||||||||||
Here is a systemd service that can be used to perform this task automatically: | |||||||||||||||||||
.. code-block:: ini | |||||||||||||||||||
[Unit] | |||||||||||||||||||
Description=swh-graph memory sharing in tmpfs | |||||||||||||||||||
[Service] | |||||||||||||||||||
Type=oneshot | |||||||||||||||||||
RemainAfterExit=yes | |||||||||||||||||||
ExecStart=mkdir -p /dev/shm/swh-graph/default | |||||||||||||||||||
ExecStart=sh -c "ln -s /.../compressed/* /dev/shm/swh-graph/default" | |||||||||||||||||||
ExecStart=cp --remove-destination /.../compressed/graph.graph /dev/shm/swh-graph/default | |||||||||||||||||||
Done Inline Actionswhy sh? vlorentz: why `sh`? | |||||||||||||||||||
Done Inline ActionsBecause I need globbing seirl: Because I need globbing | |||||||||||||||||||
Done Inline Actions(Not here actually, fixing this.) seirl: (Not here actually, fixing this.) | |||||||||||||||||||
ExecStart=cp --remove-destination /.../compressed/graph-transposed.graph /dev/shm/swh-graph/default | |||||||||||||||||||
ExecStop=rm -rf /dev/shm/swh-graph/default | |||||||||||||||||||
[Install] | |||||||||||||||||||
WantedBy=multi-user.target |
I suggest adding a header for java types and performance suggestions.
Here's a sample approach I have adopted. I have no idea if it is good or bad or can be improved, but it does function. This code should iterate all of the labelnames with NUM_THREADS and count the number of ".cpp" labelnames that are longer than 50 characters, and return the longest filename found & print both (not something I need to do but a simple example with enough depth to be modified to fit many uses I think).
Similarly, LongArrayList saves at least 8 bytes per entry over ArrayList<Long> and Long2LongOpenHashMap saves at least 16 bytes over every entry over HashMap<Long, Long>.
Those are the tips I have that I think would be good to add somewhere on this page.