Page MenuHomeSoftware Heritage

Excessive memory usage on storage0.euwest.azure.internal.softwareheritage.org
Closed, ResolvedPublic

Description

Since it has begun to be utilized by a new webapp server, storage0 is experiencing out of memory events.

Memory usage was relatively low until 2018-06-15:
http://munin.internal.softwareheritage.org/euwest.azure.internal.softwareheritage.org/storage0.euwest.azure.internal.softwareheritage.org/memory.html

Since then, various processes have reported being unable to allocate memory:

03:58 < swhbot> icinga PROBLEM: service journalbeat on storage0.euwest.azure.internal.softwareheritage.org is UNKNOWN: Fork failed with error code 12 (Cannot allocate memory)

and the kernel had to kill gunicorn processes:

[Mon Jun 18 11:16:29 2018] Out of memory: Kill process 55922 (gunicorn3) score 14 or sacrifice child
[Mon Jun 18 11:16:29 2018] Killed process 55922 (gunicorn3) total-vm:798648kB, anon-rss:468264kB, file-rss:0kB, shmem-rss:0kB
[Mon Jun 18 11:16:29 2018] oom_reaper: reaped process 55922 (gunicorn3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Event Timeline

ftigeot created this task.Jun 18 2018, 2:43 PM
ftigeot triaged this task as High priority.
olasd added a subscriber: olasd.Jun 18 2018, 8:07 PM

The new webapp server is a red herring: the only client hitting this frontend is icinga, which wouldn't exercise memory leaks with its 10 queries per minute. The actual usage comes from the vault workers.

However, the memory use increase indeed coincides with the deployment that happened on June 15.

Looking at the logs for the deployment shows the gunicorn config was bumped back up to 96 workers / 10k requests per worker spawn (from 48 workers / 1k requests per worker spawn). The config discrepancy was introduced on May 13, just after deployment of an updated swh.storage/swh.objstorage combo on May 12, probably for the same reason.

This really is another specific instance of T757.