Unstuck infrastructure.
What happened so far:
- Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
- Then escalation and most public facing services went down
- Analysis started on the 26/07 next morning
- First: status.softwareheritage.org manually updated to notify the issue on channels
- Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
- Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
- Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation, and deleted it from all monitors
- Fixed the storage issue and progressive restart of nodes
- This unstucks most services
- Update status.softwareheritage.org with partial service disruption notification
- Logs still dumping too much information and dangerously close to the initial issue though
- Stopping workers
- for host in {branly,hypervisor3,beaubourg} [3]
- Cleaning up voluminous logs
- Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
- Restart ceph-mon@<host>
- Restart ceph-osd@*
- Restart ceph-mgs@<host>
- ... Investigation continues and restarting services ongoing
[1] P1099
[2] https://branly.internal.softwareheritage.org:8006/
[3] Our main hypervisors our infrastructure rely upon
[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped
replaying instructions which were dumping lots of log errors.