Allow systemd service status monitoring
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Aug 24 2021, 9:42 AM

Description

Work started around adding an icinga plugin to allow checking systemd service.
This task is to continue tracking the work for that part.

Summary of the work so far:

identified check_systemd.py plugin (not packaged for buster) [1]
packaged and uploaded in swh debian repository [2] [3]

Remains to actually install puppet definitions.

[1] T3495#68911
[2] T3495#68923
[3] T3495#68927

Revisions and Commits

rSPSITE puppet-swh-site
	Abandoned		D6152 agent_checks: Update necessary source to install package
	Closed		D6124 agent_checks: Install check_systemd plugin and command
		D6154	rSPSITEbf61d160f45b Install smart monitoring tools on physical machines only
		D6151	rSPSITEa2108b0ca841 Raise alert when systemd services are detected as degraded

Related Objects

Mentioned In: rSENV64a63e2ff845: Vagrantfile: Install riverside
rSENV6c6d46295f95: Vagrantfile: Install giverny
rSPSITEad85a3942647: agent_checks: Install check_systemd plugin and command
Mentioned Here: T3495: The weekly report bot is down

Event Timeline

ardumont triaged this task as Normal priority.Aug 24 2021, 9:42 AM

ardumont created this task.

ardumont changed the task status from Open to Work in Progress.Aug 24 2021, 10:34 AM

ardumont added a project: System administration.

ardumont added a revision: D6124: agent_checks: Install check_systemd plugin and command.Aug 24 2021, 10:43 AM

ardumont moved this task from Backlog to in-progress on the System administration board.Aug 24 2021, 4:52 PM

ardumont added a revision: D6151: Raise alert when systemd services are detected as degraded.Aug 27 2021, 12:00 PM

ardumont mentioned this in rSPSITEad85a3942647: agent_checks: Install check_systemd plugin and command.Aug 27 2021, 12:30 PM

ardumont added a commit: rSPSITEa2108b0ca841: Raise alert when systemd services are detected as degraded.

Got deployed. Some more alerts are raised now. Some legitimate, some others are errors
in the deployment.

On my way to fix those.

12:56 <+swhbot> icinga PROBLEM: service check_systemd on storage01.euwest.azure.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:56 <+swhbot> icinga PROBLEM: service check_systemd on boatbucket.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on worker2.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on counters1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on search1.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:56 <+swhbot> icinga PROBLEM: service check_systemd on journal0.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - networking.service: failed, smartd.service: failed
12:56 <+swhbot> icinga PROBLEM: service check_systemd on jenkins-debian1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on pompidou.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on rp1.internal.admin.swh.network is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on giverny.softwareheritage.org is UNKNOWN: execvpe(/usr/lib/nagios/plugins/check_systemd) failed: No such file or directory
12:57 <+swhbot> icinga PROBLEM: service check_systemd on riverside.internal.softwareheritage.org is UNKNOWN: execvpe(/usr/lib/nagios/plugins/check_systemd) failed: No such file or directory
12:57 <+swhbot> icinga PROBLEM: service check_systemd on search0.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - cloud-init.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on bojimans.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on hypervisor3.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - ceph-mgr@hypervisor.service: failed, startup_time is 3060 (outside range 0:120)
12:57 <+swhbot> icinga PROBLEM: service check_systemd on webapp1.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - cloud-init.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service check_systemd on uffizi.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on worker03.euwest.azure.internal.softwareheritage.org is UNKNOWN: Check command 'check_systemd' does not exist.
12:57 <+swhbot> icinga PROBLEM: service check_systemd on vault.internal.staging.swh.network is CRITICAL: SYSTEMD CRITICAL - networking.service: failed, smartd.service: failed
12:57 <+swhbot> icinga PROBLEM: service disk /var/lib/vz on hypervisor3.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /var/lib/vz 5175 MB (13% inode=99%);
...

ardumont added a revision: D6152: agent_checks: Update necessary source to install package.Aug 27 2021, 3:02 PM

ardumont added a revision: D6154: Install smart monitoring tools on physical machines only.Aug 27 2021, 4:14 PM

ardumont added a commit: rSPSITEbf61d160f45b: Install smart monitoring tools on physical machines only.Aug 27 2021, 4:15 PM

There, some illegitimate alerts about smart error on vms were raised.

Manifests were adapted so those checks were removed from those machines.

As a result, alerts went back to green [1] (no alert) and only real alerts about those
should pop up.

Making the overall maintenance less noisy \o/

[1]

...
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker1.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker3.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on counters0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker2.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on deposit.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga PROBLEM: service check_systemd on somerset.internal.softwareheritage.org is CRITICAL: SYSTEMD CRITICAL - smartd.service: failed
16:23 <+swhbot> icinga RECOVERY: service check_systemd on worker0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on webapp.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on rp0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on search0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:23 <+swhbot> icinga RECOVERY: service check_systemd on vault.internal.staging.swh.network is OK: SYSTEMD OK - all
16:24 <+swhbot> icinga RECOVERY: service check_systemd on worker13.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:25 <+swhbot> icinga RECOVERY: service check_systemd on vangogh.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:28 <+swhbot> icinga PROBLEM: service check_systemd on kelvingrove.internal.softwareheritage.org is WARNING: SYSTEMD WARNING - startup_time is 63.41 (outside range 0:60)
16:28 <+swhbot> icinga RECOVERY: service check_systemd on webapp1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:28 <+swhbot> icinga RECOVERY: service check_systemd on bardo.internal.admin.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on storage01.euwest.azure.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on search-esnode0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:29 <+ardumont> ack
16:29 <+swhbot> icinga RECOVERY: service check_systemd on jenkins-debian1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on search1.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on boatbucket.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on worker17.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga PROBLEM: service check_systemd on giverny.softwareheritage.org is WARNING: SYSTEMD WARNING - startup_time is 81.46 (outside range 0:60)
16:29 <+swhbot> icinga RECOVERY: service check_systemd on rp1.internal.admin.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on objstorage0.internal.staging.swh.network is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on somerset.internal.softwareheritage.org is OK: SYSTEMD OK - all
16:29 <+swhbot> icinga RECOVERY: service check_systemd on counters1.internal.softwareheritage.org is OK: SYSTEMD OK - all

ardumont moved this task from in-progress to code-review/await-feedback/pause on the System administration board.Aug 27 2021, 4:33 PM

ardumont mentioned this in rSENV64a63e2ff845: Vagrantfile: Install riverside.Sep 1 2021, 9:32 AM

ardumont mentioned this in rSENV6c6d46295f95: Vagrantfile: Install giverny.

ardumont closed this task as Resolved.Sep 1 2021, 10:01 AM

ardumont claimed this task.

ardumont moved this task from code-review/await-feedback/pause to done on the System administration board.

This task has been migrated to GitLab.

Allow systemd service status monitoringClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Allow systemd service status monitoring
Closed, MigratedEdits Locked
Actions