Page MenuHomeSoftware Heritage

automate handling of hanging/dead/stuck loaders
Started, Work in Progress, HighPublic

Description

When oom killed when the cgroup runs out of memory, the git loaders have a tendency to not come back.

Unfortunately, there's several symptoms that happen when these workers stop processing tasks.

  • some workers still respond to celery inspect active, but there's a task that doesn't go away
  • some workers stop responding completely, even to celery pings.

I'm not sure if the workers still answer to pings when one of the processes is stuck, but I'll probably start that way because that's the most generic way we can monitor these processes externally.

We could enable the systemd watchdog on the processes, with a timeout of n minutes, and have the ping job run every n-1 minutes to reset the watchdog clock.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

Event Timeline

olasd triaged this task as High priority.Mar 24 2020, 5:54 PM
olasd created this task.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

To find out what loaders are currently hung, I do the following:

Set SWH_CONFIG_FILENAME to a file containing proper credentials for rabbitmq:

celery:
  task_broker: amqp://<login>:<pass>@rabbitmq:5672/%2f
import time
from ast import literal_eval

from swh.scheduler.celery_backend.config import app

destination = ["celery@loader_git.worker%02d" % i for i in range(1, 17)]

inspect = app.control.inspect(destination=destination, timeout=3)
long_running_tasks_by_worker = {
    w: [(task["worker_pid"], literal_eval(task["kwargs"])["url"], time.time() - task["time_start"])
        for task in tasks
        if time.time() - task["time_start"] > 3600]
    for w, tasks in sorted(inspect.active().items())
}

dead_workers = ','.join(w.rsplit('.', 1)[-1] for w in sorted(set(destination) - set(long_running_tasks_by_worker)))

This lists all workers that reply to the celery remote control within 3 seconds (timeout=3 in the inspect command), and on these workers, this lists the tasks that have been running for more than an hour (> 3600 in the list comprehension).

The dead_workers are not answering to the celery remote control. They probably got oom killed and stopped working.

I inspect the long-running tasks. Most tasks running for more than a few hours are just stuck. These workers get added to the dead workers list.

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders. Doing a "proper" restart won't work when some of the worker processes are hung, so we have to kill them hard.

zack added a subscriber: zack.Apr 21 2020, 2:30 PM
In T2335#43554, @olasd wrote:

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders

As I understand this, you're here killing hard all processes of the given service. Hence, pivoting back to the original proposals, this:

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

looks like the most sensible to try next?
It should be easier to try out than other options and will also allow to verify that these hangs happen only due to OOM killing.

My 0.02 €

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

Hum, i thought that was one of the main interest in those...
Must be misremembering.

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

sounds fair ;)

olasd added a comment.Apr 29 2020, 2:42 PM

I've bumped the apt-preferences config to pull systemd and related packages from buster-backports.

I'll look at upgrading/rebooting the workers once that's pulled in.

I'll look at upgrading/rebooting the workers once that's pulled in.

jsyk, I've upgraded the staging nodes with those changes. They are fine after
this. I did not reboot those though, simply restarted some systemd services.

I no longer see the warning about OOMPolicy so win, i think ;)

Thanks.

zack changed the task status from Open to Work in Progress.May 20 2020, 9:42 AM
zack renamed this task from Automate handling hanging or dead loaders to automate handling of hanging/dead/stuck loaders.Jun 8 2020, 2:23 PM
olasd added a comment.Jun 22 2020, 7:34 PM

I've deployed the ping, kill and restart bandaid referenced in puppet.

I've also ended up backporting recent versions of celery, kombu and billiard.

We'll see whether we end up with lots of cronspam or not.