Ensure critical checks are properly monitored on hypervisors as well.
At least ENOSPC.
Related to T3444
Ensure critical checks are properly monitored on hypervisors as well.
At least ENOSPC.
Related to T3444
icinga2 package is already installed so they are already monitored [1].
Checking pergamon (icinga master), they are already referenced [2]
[1]
root@pergamon:~# grep "hypervisors:" /etc/clustershell/groups hypervisors: beaubourg hypervisor3 branly pompidou uffizi root@pergamon:~# clush -b -w @hypervisors "dpkg -l icinga2" | grep icinga2 ii icinga2 2.12.5-1.buster amd64 host and network monitoring system
[2]
root@pergamon:~# find /etc/icinga2/zones.d/ -iname "*pompidou*" -o -iname "*uffizi*" -o -iname "*hypervisor3*" -o -iname "*beaubourg*" -o -iname "*branly*" /etc/icinga2/zones.d/master/uffizi.internal.softwareheritage.org.conf /etc/icinga2/zones.d/master/hypervisor3.internal.softwareheritage.org.conf /etc/icinga2/zones.d/master/branly.internal.softwareheritage.org.conf /etc/icinga2/zones.d/master/beaubourg.softwareheritage.org.conf /etc/icinga2/zones.d/master/pompidou.internal.softwareheritage.org.conf
So a priori discussing with fellow sysadm on irc, it happened but the filling happened so fast that it did not have time to warn anything...
So a priori discussing with fellow sysadm on irc, it happened but the filling happened
so fast that it did not have time to warn anything...
Quoting the discussion part [1].
I've duplicated (and edited) the autogenerated grafana dashboard targetted by olasd [1].
I've added the grafana "outage" and "maintenance" tags. That way, we can confirm what is
exchanged here. The outage happens then the logs starts growing.
So yeah, this task is invalid. Closing.
[1]
10:26 <+olasd> ardumont: I don't think the disk use ever got above the icinga warning threshold before the outage https://grafana.softwareheritage.org/d/4StZ2qbWz/filesystem-sizes-auto-generated?orgId=1&refresh=10s&var-host=beaubourg&var-target=beaubourg.internal.softwareheritage.org&var-filesystem=All&from=now-7d&to=now 10:27 <+olasd> (the warning threshold is 20% free, 10% free for critical) 10:28 <+olasd> (There are some warnings here /after/ the outage) 10:28 <+olasd> (so they do work)
[2] https://grafana.softwareheritage.org/goto/ZXokQUZ7z?orgId=1