Page MenuHomeSoftware Heritage

Make swh-scheduler's listener resilient to failure
Closed, MigratedEdits Locked

Description

It so happens that sometimes, infrastructure issues hit us. For example:

  • pb running out of disk space on our vm hosting dbs (P204)
  • pb connecting the rabbitmq queues (P205)

This impacted the listener which failed to do its bidding (flushing queues' tasks' states in the scheduler db).
We should investigate and make it more resilient (if that is even possible).

Note:
For information, in the systemd service file, we already define the Restart=always policy.
Still, it did not prevent systemd from giving up and letting the service in failure state.