Page MenuHomeSoftware Heritage

automate handling of hanging/dead/stuck loaders
Closed, MigratedEdits Locked

Description

When oom killed when the cgroup runs out of memory, the git loaders have a tendency to not come back.

Unfortunately, there's several symptoms that happen when these workers stop processing tasks.

  • some workers still respond to celery inspect active, but there's a task that doesn't go away
  • some workers stop responding completely, even to celery pings.

I'm not sure if the workers still answer to pings when one of the processes is stuck, but I'll probably start that way because that's the most generic way we can monitor these processes externally.

We could enable the systemd watchdog on the processes, with a timeout of n minutes, and have the ping job run every n-1 minutes to reset the watchdog clock.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

Event Timeline

olasd triaged this task as High priority.Mar 24 2020, 5:54 PM
olasd created this task.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

To find out what loaders are currently hung, I do the following:

Set SWH_CONFIG_FILENAME to a file containing proper credentials for rabbitmq:

celery:
  task_broker: amqp://<login>:<pass>@rabbitmq:5672/%2f
import time
from ast import literal_eval

from swh.scheduler.celery_backend.config import app

destination = ["celery@loader_git.worker%02d" % i for i in range(1, 17)]

inspect = app.control.inspect(destination=destination, timeout=3)
long_running_tasks_by_worker = {
    w: [(task["worker_pid"], literal_eval(task["kwargs"])["url"], time.time() - task["time_start"])
        for task in tasks
        if time.time() - task["time_start"] > 3600]
    for w, tasks in sorted(inspect.active().items())
}

dead_workers = ','.join(w.rsplit('.', 1)[-1] for w in sorted(set(destination) - set(long_running_tasks_by_worker)))

This lists all workers that reply to the celery remote control within 3 seconds (timeout=3 in the inspect command), and on these workers, this lists the tasks that have been running for more than an hour (> 3600 in the list comprehension).

The dead_workers are not answering to the celery remote control. They probably got oom killed and stopped working.

I inspect the long-running tasks. Most tasks running for more than a few hours are just stuck. These workers get added to the dead workers list.

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders. Doing a "proper" restart won't work when some of the worker processes are hung, so we have to kill them hard.

In T2335#43554, @olasd wrote:

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders

As I understand this, you're here killing hard all processes of the given service. Hence, pivoting back to the original proposals, this:

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

looks like the most sensible to try next?
It should be easier to try out than other options and will also allow to verify that these hangs happen only due to OOM killing.

My 0.02 €

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

Hum, i thought that was one of the main interest in those...
Must be misremembering.

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

sounds fair ;)

I've bumped the apt-preferences config to pull systemd and related packages from buster-backports.

I'll look at upgrading/rebooting the workers once that's pulled in.

I'll look at upgrading/rebooting the workers once that's pulled in.

jsyk, I've upgraded the staging nodes with those changes. They are fine after
this. I did not reboot those though, simply restarted some systemd services.

I no longer see the warning about OOMPolicy so win, i think ;)

Thanks.

zack changed the task status from Open to Work in Progress.May 20 2020, 9:42 AM
zack renamed this task from Automate handling hanging or dead loaders to automate handling of hanging/dead/stuck loaders.Jun 8 2020, 2:23 PM

I've deployed the ping, kill and restart bandaid referenced in puppet.

I've also ended up backporting recent versions of celery, kombu and billiard.

We'll see whether we end up with lots of cronspam or not.

The deployed cron invocation was buggy (fixed via rSPSITE3317ea30)

What the current cron does, is ping (via celery) the worker, and restart it if it doesn't respond after a few attempts. This catches the case where the top-level celery process doesn't respond.

However, some workers are currently stuck after a MemoryError, but the celery process itself is still processing messages and responding to them. so the current cron doesn't help too much.

We need to add an actual activity check on workers, e.g. "restart if nothing was processed in the last three hours".

Something like journalctl -u swh-worker@loader_git --since '6 hours ago' -o json showing no output would be a fair sign that something has gone wrong.

if test `journalctl -u 'swh-worker@loader_git' -o json --since '3 hours ago' | wc -l` -eq 0; then systemctl kill --signal 9 --kill-who all swh-worker@loader_git; fi

was run on all workers.