Page MenuHomeSoftware Heritage

Allow systemd service status monitoring
Closed, MigratedEdits Locked

Description

Work started around adding an icinga plugin to allow checking systemd service.
This task is to continue tracking the work for that part.

Summary of the work so far:

  • identified check_systemd.py plugin (not packaged for buster) [1]
  • packaged and uploaded in swh debian repository [2] [3]

Remains to actually install puppet definitions.

[1] T3495#68911
[2] T3495#68923
[3] T3495#68927

Event Timeline

ardumont triaged this task as Normal priority.Aug 24 2021, 9:42 AM
ardumont created this task.
ardumont changed the task status from Open to Work in Progress.Aug 24 2021, 10:34 AM

Got deployed. Some more alerts are raised now. Some legitimate, some others are errors
in the deployment.

On my way to fix those.

12:56 <+swhbot> icinga PROBLEM: service check_systemd on storage01.euwest.azure.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:56 <+swhbot> icinga PROBLEM: service check_systemd on boatbucket.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on worker2.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on counters1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on search1.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:56 <+swhbot> icinga PROBLEM: service check_systemd on journal0.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - networking.service: failed, smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on jenkins-debian1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on pompidou.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on rp1.internal.admin.swh.network is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on giverny.softwareheritage.org is UNKNOWN: execvpe(/usr/lib/nagios/plugins/check_systemd) failed: No such file or directory
12:57 <+swhbot> icinga PROBLEM: service check_systemd on riverside.internal.softwareheritage.org is UNKNOWN: execvpe(/usr/lib/nagios/plugins/check_systemd) failed: No such file or directory
12:57 <+swhbot> icinga PROBLEM: service check_systemd on search0.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - cloud-init.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on bojimans.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on hypervisor3.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - ceph-mgr@hypervisor.service: failed, startup_time is 3060 (outside range 0:120)
12:57 <+swhbot> icinga PROBLEM: service check_systemd on webapp1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - cloud-init.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on uffizi.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on worker03.euwest.azure.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on vault.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - networking.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service disk /var/lib/vz on hypervisor3.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /var/lib/vz 5175 MB (13% inode=99%);
...

There, some illegitimate alerts about smart error on vms were raised.

Manifests were adapted so those checks were removed from those machines.

As a result, alerts went back to green [1] (no alert) and only real alerts about those
should pop up.

Making the overall maintenance less noisy \o/

[1]

...
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker1.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker3.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on counters0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker2.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on deposit.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga PROBLEM: service check_systemd on somerset.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on webapp.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on rp0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on search0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on vault.internal.staging.swh.network is OK: SYSTEMD OK - all
16:24 <+swhbot> icinga RECOVERY: service check_systemd on worker13.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:25 <+swhbot> icinga RECOVERY: service check_systemd on vangogh.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:28 <+swhbot> icinga PROBLEM: service check_systemd on kelvingrove.internal.softwareheritage.org is WARNING: SYSTEMD WARNING - startup_time is 63.41 (outside range 0:60)
16:28 <+swhbot> icinga RECOVERY: service check_systemd on webapp1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:28 <+swhbot> icinga RECOVERY: service check_systemd on bardo.internal.admin.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on storage01.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on search-esnode0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:29 <+ardumont> ack
16:29 <+swhbot> icinga RECOVERY: service check_systemd on jenkins-debian1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on search1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on boatbucket.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on worker17.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga PROBLEM: service check_systemd on giverny.softwareheritage.org is WARNING: SYSTEMD WARNING - startup_time is 81.46 (outside range 0:60)
16:29 <+swhbot> icinga RECOVERY: service check_systemd on rp1.internal.admin.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on objstorage0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on somerset.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on counters1.internal.softwareheritage.org is OK: SYSTEMD OK - all
ardumont claimed this task.