stuck loaders
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Mar 24 2020, 5:54 PM

Description

When oom killed when the cgroup runs out of memory, the git loaders have a tendency to not come back.

Unfortunately, there's several symptoms that happen when these workers stop processing tasks.

some workers still respond to celery inspect active, but there's a task that doesn't go away
some workers stop responding completely, even to celery pings.

I'm not sure if the workers still answer to pings when one of the processes is stuck, but I'll probably start that way because that's the most generic way we can monitor these processes externally.

We could enable the systemd watchdog on the processes, with a timeout of n minutes, and have the ping job run every n-1 minutes to reset the watchdog clock.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

Related Objects

Mentioned In: rSPSITE4f5958e2355c: Automate restarting the workers every 15 minutes
rSPSITE046310fddc22: Introduce swh-worker-ping-restart script to automatically restart dead workers
D3248: Replace swh-worker-control with a swh scheduler celery-monitor subcommand
rSPSITEebf885ebf820: Terminate the whole worker when OOM killed
Mentioned Here: rSPSITE3317ea305594: Use the proper parameters for the worker autorestart cron

Event Timeline

olasd triaged this task as High priority.Mar 24 2020, 5:54 PM

olasd created this task.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

In T2335#42828, @ardumont wrote:

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

To find out what loaders are currently hung, I do the following:

Set SWH_CONFIG_FILENAME to a file containing proper credentials for rabbitmq:

celery:
  task_broker: amqp://<login>:<pass>@rabbitmq:5672/%2f

import time
from ast import literal_eval

from swh.scheduler.celery_backend.config import app

destination = ["celery@loader_git.worker%02d" % i for i in range(1, 17)]

inspect = app.control.inspect(destination=destination, timeout=3)
long_running_tasks_by_worker = {
    w: [(task["worker_pid"], literal_eval(task["kwargs"])["url"], time.time() - task["time_start"])
        for task in tasks
        if time.time() - task["time_start"] > 3600]
    for w, tasks in sorted(inspect.active().items())
}

dead_workers = ','.join(w.rsplit('.', 1)[-1] for w in sorted(set(destination) - set(long_running_tasks_by_worker)))

This lists all workers that reply to the celery remote control within 3 seconds (timeout=3 in the inspect command), and on these workers, this lists the tasks that have been running for more than an hour (> 3600 in the list comprehension).

The dead_workers are not answering to the celery remote control. They probably got oom killed and stopped working.

I inspect the long-running tasks. Most tasks running for more than a few hours are just stuck. These workers get added to the dead workers list.

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders. Doing a "proper" restart won't work when some of the worker processes are hung, so we have to kill them hard.

In T2335#43554, @olasd wrote:

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders

As I understand this, you're here killing hard all processes of the given service. Hence, pivoting back to the original proposals, this:

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

looks like the most sensible to try next?
It should be easier to try out than other options and will also allow to verify that these hangs happen only due to OOM killing.

My 0.02 €

olasd mentioned this in rSPSITEebf885ebf820: Terminate the whole worker when OOM killed.Apr 27 2020, 9:58 AM

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

Hum, i thought that was one of the main interest in those...
Must be misremembering.

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

sounds fair ;)

I've bumped the apt-preferences config to pull systemd and related packages from buster-backports.

I'll look at upgrading/rebooting the workers once that's pulled in.

I'll look at upgrading/rebooting the workers once that's pulled in.

jsyk, I've upgraded the staging nodes with those changes. They are fine after
this. I did not reboot those though, simply restarted some systemd services.

I no longer see the warning about OOMPolicy so win, i think ;)

Thanks.

zack changed the task status from Open to Work in Progress.May 20 2020, 9:42 AM

zack renamed this task from Automate handling hanging or dead loaders to automate handling of hanging/dead/stuck loaders.Jun 8 2020, 2:23 PM

olasd mentioned this in D3248: Replace swh-worker-control with a swh scheduler celery-monitor subcommand.Jun 9 2020, 10:41 AM

olasd mentioned this in rSPSITE046310fddc22: Introduce swh-worker-ping-restart script to automatically restart dead workers.Jun 22 2020, 6:40 PM

olasd mentioned this in rSPSITE4f5958e2355c: Automate restarting the workers every 15 minutes.

I've deployed the ping, kill and restart bandaid referenced in puppet.

I've also ended up backporting recent versions of celery, kombu and billiard.

We'll see whether we end up with lots of cronspam or not.

The deployed cron invocation was buggy (fixed via rSPSITE3317ea30)

What the current cron does, is ping (via celery) the worker, and restart it if it doesn't respond after a few attempts. This catches the case where the top-level celery process doesn't respond.

However, some workers are currently stuck after a MemoryError, but the celery process itself is still processing messages and responding to them. so the current cron doesn't help too much.

We need to add an actual activity check on workers, e.g. "restart if nothing was processed in the last three hours".

Something like journalctl -u swh-worker@loader_git --since '6 hours ago' -o json showing no output would be a fair sign that something has gone wrong.

if test `journalctl -u 'swh-worker@loader_git' -o json --since '3 hours ago' | wc -l` -eq 0; then systemctl kill --signal 9 --kill-who all swh-worker@loader_git; fi

was run on all workers.

This task has been migrated to GitLab.

automate handling of hanging/dead/stuck loadersClosed, MigratedEdits LockedActions

Description

Related Objects

Event Timeline

automate handling of hanging/dead/stuck loaders
Closed, MigratedEdits Locked
Actions