Page MenuHomeSoftware Heritage

Scrubber processes getting killed by OOM killer
Closed, MigratedEdits Locked

Description

On the production server, some scrubber processes are killed by the OOM killer

Jul 10 18:37:59 scrubber1 kernel: Out of memory: Kill process 3192 (swh) score 116 or sacrifice child
Jul 10 18:37:59 scrubber1 kernel: Killed process 3192 (swh) total-vm:297888kB, anon-rss:229668kB, file-rss:24kB, shmem-rss:0kB
Jul 10 18:37:59 scrubber1 kernel: oom_reaper: reaped process 3192 (swh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jul 10 18:37:59 scrubber1 swh[28310]: INFO:swh.scrubber.storage_checker:Processing revision range bd30e8 to bd30e9
Jul 10 18:37:59 scrubber1 swh[10665]: INFO:swh.scrubber.storage_checker:Processing revision range 48b4e6 to 48b4e7
Jul 10 18:37:59 scrubber1 systemd[1]: swh-scrubber-checker-postgres@directory-3.service: Main process exited, code=killed, status=9/KILL
 free -h
              total        used        free      shared  buff/cache   available
Mem:          978Mi       756Mi       109Mi       0.0Ki       112Mi        47Mi
Swap:         975Mi       962Mi        13Mi

Despite the ballooning allowed to 4g, it seems the server remains to 1Go.

Event Timeline

vsellier updated the task description. (Show Details)
ardumont triaged this task as Normal priority.Jul 11 2022, 10:50 AM

After upgrading packages, and reboot.
This seems to have increased a bit its memory to something more sensible.
Let's see if the problem disappears altogether now.

ardumont changed the task status from Open to Work in Progress.Jul 11 2022, 2:21 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

Still happening:

[  +0.063819] Killed process 1216634 (swh) total-vm:262252kB, anon-rss:206288kB, file-rss:724kB, shmem-rss:0kB
[  +0.077492] oom_reaper: reaped process 1216634 (swh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Jul27 05:29] journalbeat invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  +0.000002] journalbeat cpuset=/ mems_allowed=0
[  +0.000023] CPU: 2 PID: 688 Comm: journalbeat Not tainted 4.19.0-21-amd64 #1 Debian 4.19.249-2
--
[  +0.000002] [1250623]   997 1250623     3349       70    65536        4             0 systemctl
[  +0.000001] Out of memory: Kill process 1199696 (swh) score 82 or sacrifice child
[  +0.063517] Killed process 1199696 (swh) total-vm:201224kB, anon-rss:47784kB, file-rss:0kB, shmem-rss:0kB
[  +0.066909] oom_reaper: reaped process 1199696 (swh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Jul27 06:12] kworker/0:0: page allocation failure: order:0, mode:0x6310ca(GFP_HIGHUSER_MOVABLE|__GFP_NORETRY|__GFP_NOMEMALLOC), nodemask=(null)
[  +0.000002] kworker/0:0 cpuset=/ mems_allowed=0
[  +0.000011] CPU: 0 PID: 1244099 Comm: kworker/0:0 Not tainted 4.19.0-21-amd64 #1 Debian 4.19.249-2
--
[  +0.000001] [1257717]   997 1257717     1369       35    49152        0             0 sudo
[  +0.000005] Out of memory: Kill process 1081929 (swh) score 136 or sacrifice child
[  +0.102631] Killed process 1081929 (swh) total-vm:303484kB, anon-rss:178744kB, file-rss:4kB, shmem-rss:0kB
[  +0.132669] oom_reaper: reaped process 1081929 (swh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Jul27 09:29] swh invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  +0.000002] swh cpuset=/ mems_allowed=0
[  +0.000010] CPU: 2 PID: 1209826 Comm: swh Not tainted 4.19.0-21-amd64 #1 Debian 4.19.249-2
--
[  +0.000001] [1264290]     0 1264290     3544       63    65536        0             0 check_journal
[  +0.000001] Out of memory: Kill process 1112652 (swh) score 70 or sacrifice child
[  +0.060327] Killed process 1112652 (swh) total-vm:195560kB, anon-rss:20828kB, file-rss:436kB, shmem-rss:0kB
[  +0.066241] oom_reaper: reaped process 1112652 (swh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Bumped the balloning to 2048, terraform applied the change which restarted the vm.
Hopefully that will be the end of this.

I've ended dropping the ballooning for that node.
As i've deployed twice as much services as before to scrub somerset as well [1]

Closing this now. We can always open it up if the problem persist.

[1] T4371

ardumont renamed this task from scrubber process killed by OOM killer to Scrubber processes getting killed by OOM killer.Aug 4 2022, 3:54 PM
ardumont closed this task as Resolved.
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.