Page MenuHomeSoftware Heritage

Monitor hypervisors as well
Closed, ResolvedPublic

Description

Ensure critical checks are properly monitored on hypervisors as well.
At least ENOSPC.

Related to T3444

Event Timeline

ardumont created this task.

icinga2 package is already installed so they are already monitored [1].
Checking pergamon (icinga master), they are already referenced [2]

[1]

root@pergamon:~# grep "hypervisors:" /etc/clustershell/groups
hypervisors: beaubourg hypervisor3 branly pompidou uffizi
root@pergamon:~# clush -b -w @hypervisors "dpkg -l icinga2" | grep icinga2
ii  icinga2        2.12.5-1.buster amd64        host and network monitoring system

[2]

root@pergamon:~# find /etc/icinga2/zones.d/ -iname "*pompidou*" -o -iname "*uffizi*" -o -iname "*hypervisor3*" -o -iname "*beaubourg*" -o -iname "*branly*"
/etc/icinga2/zones.d/master/uffizi.internal.softwareheritage.org.conf
/etc/icinga2/zones.d/master/hypervisor3.internal.softwareheritage.org.conf
/etc/icinga2/zones.d/master/branly.internal.softwareheritage.org.conf
/etc/icinga2/zones.d/master/beaubourg.softwareheritage.org.conf
/etc/icinga2/zones.d/master/pompidou.internal.softwareheritage.org.conf

So a priori discussing with fellow sysadm on irc, it happened but the filling happened so fast that it did not have time to warn anything...

So a priori discussing with fellow sysadm on irc, it happened but the filling happened
so fast that it did not have time to warn anything...

Quoting the discussion part [1].

I've duplicated (and edited) the autogenerated grafana dashboard targetted by olasd [1].
I've added the grafana "outage" and "maintenance" tags. That way, we can confirm what is
exchanged here. The outage happens then the logs starts growing.

So yeah, this task is invalid. Closing.

[1]

10:26 <+olasd> ardumont: I don't think the disk use ever got above the icinga warning threshold before the outage https://grafana.softwareheritage.org/d/4StZ2qbWz/filesystem-sizes-auto-generated?orgId=1&refresh=10s&var-host=beaubourg&var-target=beaubourg.internal.softwareheritage.org&var-filesystem=All&from=now-7d&to=now
10:27 <+olasd> (the warning threshold is 20% free, 10% free for critical)
10:28 <+olasd> (There are some warnings here /after/ the outage)
10:28 <+olasd> (so they do work)

[2] https://grafana.softwareheritage.org/goto/ZXokQUZ7z?orgId=1

ardumont claimed this task.
ardumont changed the task status from Invalid to Resolved.Jul 29 2021, 1:23 PM
ardumont moved this task from Backlog to done on the System administration board.