Page MenuHomeSoftware Heritage

swh-graph: loading maps fail when available memory is too low: Cannot allocate memory
Closed, MigratedEdits Locked

Description

As per title, this happens on granet when doing systemctl status swhgraphdev.service (after having successfully started swhgraphshm) here's a log:

lug 15 15:12:12 granet systemd[1]: Started swh graph.
lug 15 15:12:12 granet swh[1535223]: INFO:root:using swh-graph JAR: /opt/swhgraph_venv/share/swh-graph/swh-graph-0.2.7.jar
lug 15 15:12:13 granet swh[1535234]: Loading graph /dev/shm/swh-graph/default/graph ...
lug 15 15:14:47 granet swh[1535234]: Graph loaded.
lug 15 15:14:47 granet swh[1535223]: Traceback (most recent call last):
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/bin/swh", line 10, in <module>
lug 15 15:14:47 granet swh[1535223]:     sys.exit(main())
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 97, in main
lug 15 15:14:47 granet swh[1535223]:     return swh(auto_envvar_prefix='SWH')
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 764, in __call__
lug 15 15:14:47 granet swh[1535223]:     return self.main(*args, **kwargs)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 717, in main
lug 15 15:14:47 granet swh[1535223]:     rv = self.invoke(ctx)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
lug 15 15:14:47 granet swh[1535223]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
lug 15 15:14:47 granet swh[1535223]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 956, in invoke
lug 15 15:14:47 granet swh[1535223]:     return ctx.invoke(self.callback, **ctx.params)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/core.py", line 555, in invoke
lug 15 15:14:47 granet swh[1535223]:     return callback(*args, **kwargs)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
lug 15 15:14:47 granet swh[1535223]:     return f(get_current_context(), *args, **kwargs)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/cli.py", line 284, in serve
lug 15 15:14:47 granet swh[1535223]:     with backend:
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 57, in __enter__
lug 15 15:14:47 granet swh[1535223]:     self.pid2node = PidToNodeMap(self.graph_path + "." + PID2NODE_EXT)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/pid.py", line 192, in __init__
lug 15 15:14:47 granet swh[1535223]:     super().__init__(self.RECORD_SIZE, fname, mode=mode, length=length)
lug 15 15:14:47 granet swh[1535223]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/pid.py", line 129, in __init__
lug 15 15:14:47 granet swh[1535223]:     self.fd, self.size, flags=MAP_SHARED if writable_map else MAP_PRIVATE
lug 15 15:14:47 granet swh[1535223]: OSError: [Errno 12] Cannot allocate memory
lug 15 15:14:47 granet systemd[1]: swhgraphdev.service: Main process exited, code=exited, status=1/FAILURE

I think it's related to the shm trick.

This is stopping starting the graph service on the newly exported graph.

Event Timeline

zack triaged this task as High priority.Jul 15 2020, 5:19 PM
zack created this task.

I think it's related to the shm trick.

Confirmed. With the swhgraphshm service not running the swh-graph service starts just fine:

$ /opt/swhgraph_venv/bin/swh graph rpc-serve -g /srv/softwareheritage/ssd/graph/2020-05-20/compressed/graph
INFO:root:using swh-graph JAR: /opt/swhgraph_venv/share/swh-graph/swh-graph-0.2.7.jar
Loading graph /srv/softwareheritage/ssd/graph/2020-05-20/compressed/graph ...
Graph loaded.
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
zack renamed this task from swh-graph: memory mapping swhid<->node maps fail: Cannot allocate memory to swh-graph: loading maps fail when swhgraphshm is running: Cannot allocate memory.Jul 15 2020, 5:49 PM

The ZFS ARC (zfs's page cache) is set to grow without bounds.

At the point @seirl pointed me at this issue, it was using around 300GB of memory (from the output of arc_summary).

The ARC should shrink when there's memory pressure, but the mmap of the full graph in shm seems to overflow the kernel's overcommit allowance.

Turns out mmap with MAP_PRIVATE does a full allocation even if you're only reading pages.

For future reference, here are the commands I used to force-shrink the ARC:

# allocate 64 GB max to the arc
sudo sh -c 'echo 68719476736 > /sys/module/zfs/parameters/zfs_arc_max'

# force-flush all caches
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Just to be clear, the problem here wasn't directly linked to swhgraphshm but simply to the amount of available memory, because the MAP_PRIVATE flag tried to reserve all that memory to be able to perform copy on write. Using MAP_SHARED + PROT_READ avoids having this memory reservation and fixes the issue. swhgraphshm was just a random process taking a lot of the available ram, not specifically the reason why it failed.

seirl renamed this task from swh-graph: loading maps fail when swhgraphshm is running: Cannot allocate memory to swh-graph: loading maps fail when available memory is too low: Cannot allocate memory.Jul 23 2020, 4:06 PM