Page MenuHomeSoftware Heritage

git loaders are getting oom-killed repeatedly in prod
Closed, MigratedEdits Locked

Description

Even though we try to limit the size of packfiles we're allowing to archive (to 4GiB), the git loaders are being repeatedly OOM-killed on the production VMs (that all have 12GiB of RAM and some swap space).

This seems to be blocking almost all archival of git repositories.

Event Timeline

olasd changed the task status from Open to Work in Progress.Feb 3 2021, 3:38 PM
olasd triaged this task as High priority.
olasd created this task.

Attempts at mitigating the issue:

The loader is getting OOM-killed during the initial fetch operation, which is supposed to be limited to 4 gigabytes.

However, when fetching from a HTTP remote (which is 99.99+% of our git origins), dulwich fetches the packfile in full in a memory buffer, before getting a read handle on the memory buffer to give data to our size-limiting do_pack function. Memory overflows.

https://github.com/dulwich/dulwich/blob/4e70c1becb1254ca5e20cdd7087d83444cfb2227/dulwich/client.py#L1652

Note that, with this code in dulwich, we can't work around this issue by writing the packfile as a tempfile to disk either: it's going to get loaded in memory no matter what (at least once by dulwich in its BytesIO, and supposedly also once by us in our own BytesIO).

My current workaround attempt is switching pack fetches from https://github.com/* to git://github.com/*, transparently in the git loader; dulwich's git over TCP transport doesn't have to do the same "double-buffering" as the https transport, so it should allow us to fail earlier (hopefully without involving the oom killer).

It's not a good long-term fix, as there's a somewhat frequent background noise about deprecating git over tcp on large platforms because of its inherent insecurity - considering that it doesn't authenticate the peer at all - (e.g. https://twitter.com/patricktoomey/status/1355202062334767105). We'll really want to improve our behavior for git over https in general.

After mulling this over with @zack, and looking at the starved worker logs for a while, I suspect that we're also being bitten by our (early, early) choice of using celery acks_late, which only acknowledges tasks when they're done: when a worker is OOM-killed, it will never send task acknowledgements to rabbitmq, which will keep re-sending it the tasks.

After a while, there's a good chance that a lot of the tasks we're trying to process, are actually retries of OOM-killed tasks.

Considering the number of external means we have to retry failed tasks (built into swh.scheduler, or within the overarching swh ingestion feedback loop), we can start using early acknowledgements, and let the tasks be rescheduled externally to rabbitmq/celery's own retry mechanisms.