Page MenuHomeSoftware Heritage

Out of memory on granet
Closed, MigratedEdits Locked

Description

Granet is running out of memory each night since a couple of days.

When the OOM occurs, the graph backend is killed interrupting the service (among other things:

ep 08 01:18:43 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:14 +0000] "GET /graph/leaves/swh:1:cnt:aea17e58c32146ba8ab7cd6db067c8effb7a4161?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:43 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:14 +0000] "GET /graph/leaves/swh:1:cnt:5db7ca17bcd5bc303981444e5d24bb11d9bd9ca1?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:43 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:13 +0000] "GET /graph/leaves/swh:1:cnt:08b12086f6e478d0ab4523cc5468808185e0bec4?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:43 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:24 +0000] "GET /graph/leaves/swh:1:cnt:2fa8bdd4f1f4a75b116ae90ddc31f757325f8b61?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:49 granet systemd[1]: prometheus-node-exporter-ipmitool-sensor.service: Failed to fork: Cannot allocate memory
Sep 08 01:18:49 granet systemd[1]: prometheus-node-exporter-ipmitool-sensor.service: Failed to run 'start' task: Cannot allocate memory
Sep 08 01:18:49 granet systemd[1]: prometheus-node-exporter-ipmitool-sensor.service: Failed with result 'resources'.
Sep 08 01:18:50 granet systemd[1]: Failed to start Collect ipmitool sensor metrics for prometheus-node-exporter.
Sep 08 01:18:50 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:26 +0000] "GET /graph/leaves/swh:1:cnt:c196bd382501941f5fa8bddf1c9e097b75baa64a?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 4236 "-" "python-requests/2
Sep 08 01:18:51 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:30 +0000] "GET /graph/leaves/swh:1:cnt:bdd879ad9a0b41df6f0a9a6435b14567ecc57fa3?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 308 "-" "python-requests/2.
Sep 08 01:18:52 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:30 +0000] "GET /graph/leaves/swh:1:cnt:5c1d96699c8fa3fa3d75859eb90c2c5a9312b93c?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:52 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:30 +0000] "GET /graph/leaves/swh:1:cnt:5b827189f4245d9c485898ab81c69d57703b1658?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:18:55 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:30 +0000] "GET /graph/leaves/swh:1:cnt:1d4b62add77313ef18e87faa34776c3c71c3aba5?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 768 "-" "python-requests/2.
Sep 08 01:19:01 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:34 +0000] "GET /graph/leaves/swh:1:cnt:8cdca2dffa7f1a0ed1afabe81757bc6a3daf886b?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 308 "-" "python-requests/2.
Sep 08 01:19:02 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:34 +0000] "GET /graph/leaves/swh:1:cnt:97ecfd5b6eaf83587eac967a9aa4d0c24d7bf0c0?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:19:11 granet sshd[3958]: error: fork: Cannot allocate memory
Sep 08 01:19:41 granet sshd[3958]: error: fork: Cannot allocate memory
Sep 08 01:19:42 granet swh[3702779]: INFO:aiohttp.access:192.168.100.31 [08/Sep/2022:01:18:40 +0000] "GET /graph/leaves/swh:1:cnt:120dfcd453a1c1e1f4a7c19534f22e66a0de0402?direction=backward&resolve_origins=true&limit=1&max_edges=0 HTTP/1.1" 200 206 "-" "python-requests/2.
Sep 08 01:19:43 granet swh[3702779]: ERROR:root:Cannot write to closing transport
...
Sep 08 01:20:16 granet swh[3702779]:     await self._flush_buffer()
Sep 08 01:20:57 granet kernel: pool-1-thread-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Sep 08 01:20:57 granet kernel: pool-1-thread-1 cpuset=/ mems_allowed=0-1
Sep 08 01:20:57 granet kernel: CPU: 16 PID: 3740405 Comm: pool-1-thread-1 Tainted: P           OE     4.19.0-20-amd64 #1 Debian 4.19.235-1
Sep 08 01:20:57 granet kernel: Hardware name: Dell Inc. PowerEdge R740xd/014X06, BIOS 2.13.3 12/13/2021
Sep 08 01:20:57 granet kernel: Call Trace:
Sep 08 01:20:57 granet kernel:  dump_stack+0x66/0x81
Sep 08 01:20:57 granet kernel:  dump_header+0x6b/0x283
Sep 08 01:20:57 granet kernel:  oom_kill_process.cold.30+0xb/0x1cf
Sep 08 01:20:57 granet kernel:  ? oom_badness+0x23/0x140

Event Timeline

vsellier triaged this task as High priority.Sep 8 2022, 9:38 AM
vsellier created this task.

@vlorentz I assigned the task to you because if I'm not wrong you are running some experiments on granet.
I don't know what, but you should be more gentle with the server

I'll try reducing -Xmx again...

Nope, I can't lower it.

I guess I'll have to rewrite seirl's FindEarliestRevision.java to use the gRPC protocol instead of loading the graph itself

vlorentz changed the task status from Open to Work in Progress.Sep 9 2022, 2:36 PM
vlorentz moved this task from Backlog to In progress on the Compressed graph service board.

in the end I ran FindEarliestRevision.java on a different server, which worked nicely