Change Details

Unstuck infrastructure. What happened so far: - Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1] - Then escalation and most public facing services went down - Analysis started on the 26/07 next morning - First: status.softwareheritage.org manually updated to notify the issue on channels - Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication) - Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable - Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation, and deleted it from all monitors - Fixed the storage issue and progressive restart of nodes - This unstucks most services - Update status.softwareheritage.org with partial service disruption notification - Logs still dumping too much information and dangerously close to the initial issue though - Stopping workers - for host in {branly,hypervisor3,beaubourg} [3] - Cleaning up voluminous logs - Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4] - Restart ceph-mon@<host> - Restart ceph-osd@* - Restart ceph-mgs@<host> - ... Investigation continues and restarting services ongoing [1] P1099 [2] https://branly.internal.softwareheritage.org:8006/ [3] Our main hypervisors our infrastructure rely upon [4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped replaying instructions which were dumping lots of log errors.