Description

It so happens that sometimes, infrastructure issues hit us. For example:

pb running out of disk space on our vm hosting dbs (P204)
pb connecting the rabbitmq queues (P205)

This impacted the listener which failed to do its bidding (flushing queues' tasks' states in the scheduler db).
We should investigate and make it more resilient (if that is even possible).

Note:
For information, in the systemd service file, we already define the Restart=always policy.
Still, it did not prevent systemd from giving up and letting the service in failure state.

Revisions and Commits

rSPSITE puppet-swh-site
	rSPSITE69d83abf6b06 Always set RestartSec when setting a service to Restart=always
rSPPROF puppet-swh-profile
	rSPPROF69d83abf6b06 Always set RestartSec when setting a service to Restart=always

Related Objects

Mentioned In: rDSCH4b918afaade8: Fix issue when updating task-type without any retry delay defined
Mentioned Here: P231 scheduler-listener: Issue during update, service repeats trying starting and failing until giving up
P204 listener failure on running out of disk in prado (db host vm)
P205 listener failure on connection issues with the rabbitmq instance

Event Timeline

ardumont created this task.Dec 15 2017, 9:43 AM

ardumont updated the task description. (Show Details)Dec 15 2017, 10:06 AM

A new one, issue during update, tries for a long time to restart. Finally giving up - P231

ardumont mentioned this in rDSCH4b918afaade8: Fix issue when updating task-type without any retry delay defined.Mar 8 2018, 11:28 AM

olasd closed this task as Resolved by committing rSPPROF69d83abf6b06: Always set RestartSec when setting a service to Restart=always.Mar 8 2018, 4:13 PM

olasd added a commit: rSPPROF69d83abf6b06: Always set RestartSec when setting a service to Restart=always.

systemd will stop restarting a service if it does so too often (n times in p seconds); RestartSec=10 spaces out restarts so that this behavior doesn't trigger

olasd added a commit: rSPSITE69d83abf6b06: Always set RestartSec when setting a service to Restart=always.Jun 15 2018, 2:30 PM

This task has been migrated to GitLab.

Make swh-scheduler's listener resilient to failureClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Make swh-scheduler's listener resilient to failure
Closed, MigratedEdits Locked
Actions