Unstuck infrastructure.
What happened so far:
- Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
- Then escalation and most public facing services went down
- Analysis started on the 26/07 next morning
- First: status.softwareheritage.org manually updated to notify the issue on channels
- Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
- IDRAC connection and serial console use to access hypervisor(s) and analyze trouble
- Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
- Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation
- deleted /var/log/ceph.log from all ceph monitors to free disk space
- restarted ceph monitors
- Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)
- Progressive restart of VMs
- This unstucks most services
- Update status.softwareheritage.org with partial service disruption notification
- Logs still dumping too much information and dangerously close to the initial issue though
- Stopping workers
- for host in {branly,hypervisor3,beaubourg} [3]
- Cleaning up voluminous logs
- Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
- Restart ceph-mon@<host>
- Restart ceph-osd@*
- Restart ceph-mgr@<host>
- ... Investigation continues and restarting services ongoing
- vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]
- Investigation did not identify yet the source of the issue
- The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).
[1] P1099
[2] https://branly.internal.softwareheritage.org:8006/
[3] Our main hypervisors our infrastructure rely upon
[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped
replaying instructions which were dumping lots of log errors.
[5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1