Change Details

Unstuck infrastructure. What happened so far: - Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1] - Then escalation and most public facing services went down - Analysis started on the 26/07 next morning - First: status.softwareheritage.org manually updated to notify the issue on channels - Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication) - IDRAC connection and serial console use to access hypervisor(s) and analyze trouble - Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable - Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation - deleted /var/log/ceph.log from all ceph monitors to free disk space - restarted ceph monitors - Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later) - Progressive restart of VMs - This unstucks most services - Update status.softwareheritage.org with partial service disruption notification - Logs still dumping too much information and dangerously close to the initial issue though - Stopping workers - for host in {branly,hypervisor3,beaubourg} [3] - Cleaning up voluminous logs - Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4] - Restart ceph-mon@<host> - Restart ceph-osd@* - Restart ceph-mgs@<host> - ... Investigation continues and restarting services ongoing - vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5] - Investigation did not identify yet the source of the issue - The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07). [1] P1099 [2] https://branly.internal.softwareheritage.org:8006/ [3] Our main hypervisors our infrastructure rely upon [4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped replaying instructions which were dumping lots of log errors. [5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1 [6] https://status.softwareheritage.org/