git loaders are getting oom-killed repeatedly in prod
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Feb 3 2021, 3:38 PM

Description

Even though we try to limit the size of packfiles we're allowing to archive (to 4GiB), the git loaders are being repeatedly OOM-killed on the production VMs (that all have 12GiB of RAM and some swap space).

This seems to be blocking almost all archival of git repositories.

Revisions and Commits

rDLDG Git loader
	Closed		D5657 Spool large packfiles to disk instead of consuming tons of memory
rDSCH Scheduling utilities
		D5003	rDSCH14feab952380 celery: acknowledge tasks as soon as they're received

Related Objects

Mentioned In: T3625: Reduce git loader memory footprint
T3457: Some git repositories are failing to be ingested because of MemoryError
Mentioned Here: rSPSITEa9a0b1e77acf: Kick down git loader concurrency
rSPSITE09a6a005fea0: Single-thread git loader

Event Timeline

olasd changed the task status from Open to Work in Progress.Feb 3 2021, 3:38 PM

olasd triaged this task as High priority.

olasd created this task.

Attempts at mitigating the issue:

a9a0b1e77acfbb63cd91e080002aa2b0cd58bed8 (reduce git loader concurrency to 3)
09a6a005fea0eec8ce3c32bc0b76c979d94ca469 (reduce to 1)

The loader is getting OOM-killed during the initial fetch operation, which is supposed to be limited to 4 gigabytes.

However, when fetching from a HTTP remote (which is 99.99+% of our git origins), dulwich fetches the packfile in full in a memory buffer, before getting a read handle on the memory buffer to give data to our size-limiting do_pack function. Memory overflows.

https://github.com/dulwich/dulwich/blob/4e70c1becb1254ca5e20cdd7087d83444cfb2227/dulwich/client.py#L1652

Note that, with this code in dulwich, we can't work around this issue by writing the packfile as a tempfile to disk either: it's going to get loaded in memory no matter what (at least once by dulwich in its BytesIO, and supposedly also once by us in our own BytesIO).

My current workaround attempt is switching pack fetches from https://github.com/* to git://github.com/*, transparently in the git loader; dulwich's git over TCP transport doesn't have to do the same "double-buffering" as the https transport, so it should allow us to fail earlier (hopefully without involving the oom killer).

It's not a good long-term fix, as there's a somewhat frequent background noise about deprecating git over tcp on large platforms because of its inherent insecurity - considering that it doesn't authenticate the peer at all - (e.g. https://twitter.com/patricktoomey/status/1355202062334767105). We'll really want to improve our behavior for git over https in general.

olasd added a revision: D5003: celery: acknowledge tasks as soon as they're received.Feb 3 2021, 8:11 PM

After mulling this over with @zack, and looking at the starved worker logs for a while, I suspect that we're also being bitten by our (early, early) choice of using celery acks_late, which only acknowledges tasks when they're done: when a worker is OOM-killed, it will never send task acknowledgements to rabbitmq, which will keep re-sending it the tasks.

After a while, there's a good chance that a lot of the tasks we're trying to process, are actually retries of OOM-killed tasks.

Considering the number of external means we have to retry failed tasks (built into swh.scheduler, or within the overarching swh ingestion feedback loop), we can start using early acknowledgements, and let the tasks be rescheduled externally to rabbitmq/celery's own retry mechanisms.

olasd added a commit: rDSCH14feab952380: celery: acknowledge tasks as soon as they're received.Feb 3 2021, 10:52 PM

ardumont mentioned this in T3457: Some git repositories are failing to be ingested because of MemoryError.Aug 9 2021, 4:18 PM

ardumont added a revision: D5657: Spool large packfiles to disk instead of consuming tons of memory.Oct 4 2021, 1:15 PM

ardumont mentioned this in T3625: Reduce git loader memory footprint.Oct 4 2021, 1:19 PM

This task has been migrated to GitLab.

git loaders are getting oom-killed repeatedly in prodClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

git loaders are getting oom-killed repeatedly in prod
Closed, MigratedEdits Locked
Actions