Page MenuHomeSoftware Heritage

26/07/2021: Unstuck infrastructure outage then post-mortem
Started, Work in Progress, HighPublic

Description

Unstuck infrastructure.

What happened so far:

  • Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
  • Then escalation and most public facing services went down
  • Analysis started on the 26/07 next morning
  • First: status.softwareheritage.org manually updated to notify the issue on channels
  • Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
  • IDRAC connection and serial console use to access hypervisor(s) and analyze trouble
  • Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
  • Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation
  • deleted /var/log/ceph.log from all ceph monitors to free disk space
  • restarted ceph monitors
  • Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)
  • Progressive restart of VMs
  • This unstucks most services
  • Update status.softwareheritage.org with partial service disruption notification
  • Logs still dumping too much information and dangerously close to the initial issue though
  • Stopping workers
  • for host in {branly,hypervisor3,beaubourg} [3]
    • Cleaning up voluminous logs
    • Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
    • Restart ceph-mon@<host>
    • Restart ceph-osd@*
    • Restart ceph-mgr@<host>
  • ... Investigation continues and restarting services ongoing
  • vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]
  • Investigation did not identify yet the source of the issue
  • The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).

[1] P1099

[2] https://branly.internal.softwareheritage.org:8006/

[3] Our main hypervisors our infrastructure rely upon

[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped
replaying instructions which were dumping lots of log errors.

[5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1

[6] https://status.softwareheritage.org/

Event Timeline

ardumont changed the task status from Open to Work in Progress.Jul 26 2021, 2:09 PM
ardumont triaged this task as High priority.
ardumont created this task.
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)

Potential issues/weakness of our current infra:

  • firewalls are virtual machines which root storage is on ceph, so when ceph is down, we have no easy way of connecting to the hypervisors/ceph machines
  • ceph is not properly monitored (ENOSPC should not get unoticed on these machines),
  • (some) sysadmin documentation should be accessible when most the infrastructure is down, especially adm (idrac) ip adresses
  • console access to any bare metal machine should be possible using cli tooling (ipmi)
  • console access to virtual machines should be possible from hypervisors (eg. qm terminal <vmid>, just a matter of starting getty@ttyS0.service on VMs)

Thanks for looking into this.
What about sending logs to a separate dedicated logging machine instead of storing them locally?

firewalls are virtual machines which root storage is on ceph, so when ceph is down, we
have no easy way of connecting to the hypervisors/ceph machines

Right. Discussing with @vsellier which helped a lot to untangle this mess. I recalled we
mentioned making those real hardware (and most likely machines not depending on ceph).

ceph is not properly monitored (ENOSPC should not get unoticed on these machines),

P1099 and further earlier logs from that moment do not seem to warn about this... T3445
got created for this.

(some) sysadmin documentation should be accessible when most the infrastructure is
down, especially adm (idrac) ip adresses

That bumps in priority the moving sysadm docs to sphinx task [1] (which is deployed to
docs.softwareheritage.org) [2]. If it goes down, we should still have access to a local
git repository swh-docs.

console access to any bare metal machine should be possible using cli tooling (ipmi)

Yes, and we something for this on "unavailable" docs at the time [3]
Although, for some reason at that time, that would not work...

console access to virtual machines should be possible from hypervisors (eg. qm
terminal <vmid>, just a matter of starting getty@ttyS0.service on VMs)

Thanks for the heads up on this.

What about sending logs to a separate dedicated logging machine instead of storing them locally?

That'd make sense. We do this for the rest of the infrastructure. Machines send their
logs to a vm logstash0 vm machine that pushes logs to elasticsearch.

Maybe we should evolve that scheme on dedicated machines as well.

[1] T3154

[2] docs.s.o was still accessible through this issue.

[3] https://intranet.softwareheritage.org/wiki/IDRAC#Connect_to_the_serial_console

ceph is not properly monitored (ENOSPC should not get unoticed on these machines),

P1099 and further earlier logs from that moment do not seem to warn about this... T3945
got created for this.

On this side, it's worth saying that when things started to go sideways, the ENOSPC occurred so quick no notification/alert system would have prevented it (until we have 24/7 on duty sysadmin team).

What about sending logs to a separate dedicated logging machine instead of storing them locally?

That'd make sense. We do this for the rest of the infrastructure. Machines send their
logs to a vm logstash0 vm machine that pushes logs to elasticsearch.

Maybe we should evolve that scheme on dedicated machines as well.

I used to use a (simple rsyslog config, non es based) central log system in my previous life, was pretty useful (I have hard time using logstash TBH, but the system was not as complex as the SWH stack) , but you always want to keep a local copy of the logs on machines (at least the last x hours of logs), so that does not really answer the kind of situation we went through yesterday (unless there is an efficient way of setting the logging system to only keep x amount a space locally).

Also, in this case, the log file was not generated by the system's logging service but directly by a ceph daemon (if I'm not mistaken).

unless there is an efficient way of setting the logging system to only keep x amount a space locally

we could configure logrotate to also rotate based on file size (/var/log/ceph.log is already logrotated, but only based on time). It does not guarantee there will always be free space, but rotating on size would have delayed the issue by at least a few hours, and may have triggered monitoring alerts, as it would cause the disk space to spike from time to time.

Unknown Object (User) added a subscriber: Unknown Object (User).Jul 28 2021, 8:46 AM
This comment was removed by ardumont.
ardumont renamed this task from 26/07/2021: Unstuck infrastructure which went down then post-mortem to 26/07/2021: Unstuck infrastructure outage then post-mortem.Jul 28 2021, 9:08 AM

FTR I've tried to investigate a bit to find clues of what the origin of the outage was, but I did not find any obvious culprit.

I have not been able to dig in fw machines logs however (can not access them).

my understanding of the logs is that three OSDs (osd.3, osd.11, and osd.14 IIRC) were suddenly all stuck "waiting for subops". Unfortunately, while this is easy to debug while the issue is happening (dump_ops_in_flight), I didn't think of it at the time, given the hurry to fix the issue...

So we are left to guess.

This can usually mean two things:

  1. disk IO is extremely slow on that OSD
  2. this OSD is the leader of a given PG and waiting for replication from the other OSDs

(note that option 2 on an OSD can be caused by option 1 on a replica)

However, I think we can rule out option 1, because, say, if osd.3 was stuck, it would be stuck for all PGs, so other OSDs than osd.11 and osd.14 that replicate its PGs would be erroring too.

As to what could cause issue 2... my guesses are:

  1. two stuck TCP connections between these OSDs (but other OSDs on the same machines were fine, so probably not a low-level issue??)
  2. bug in one or all three of these OSDs that caused one of their threads to be stuck
  3. (unlikely) bug in a monitor that somehow made these OSDs in inconsistent states?

I have not been able to dig in fw machines logs however

wasn't their issue entirely caused by Ceph being down?

we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.

So after digging a little, so actions may be needed: (dedicated task will be created)

  • deploy filebeat on the hypervisor and send ceph and proxmox log to the elk cluster. It can be send to dedicated indices if we want to manage the data retention finely
  • now the firewalls are the center of the infrastructure, they should be converted to bare metal servers to reduce the 3d party dependency. We may prepare a quote for 2 small servers.
  • add ceph monitoring. possibly by activating the node-exporter plugin: https://docs.ceph.com/en/nautilus/mgr/prometheus/) and/or use the ceph -s output to create icinga alerts
  • check if a relevant proxmox monitoring can be created (try to detect stalled vm for example)
  • Add an external monitoring probe to monitor the availability of archive.s.o if icinga if down too

Not directly related, but also important:

  • there is a problem on the firewall with an openvpn internal rule blocking the access to the gui of the standby firewall, it should be fixed.
  • tag the logs per environment to make the search easier on kibana (staging / production + location rocq / azure)

During the investigation we have found some interesting stuff:

  • some ceph/proxmox logs are already ingested in elk through journalbeat (osd, pve) [1]
  • when the vms start to be unresponsive, an identifiable log start to appear in the hypervisor logs relative to qmp timeout. Perhaps we could use this log to generate an critical icinga alert
  • we discovered an interesting command pvereport generating a complete report of the ceph cluster. Not sure il would have worked during the incident but there is a lot of interesting information in the output.

[1] http://kibana0.internal.softwareheritage.org:5601/goto/5248c5e6a2bde55df3124a4e7cdd72ab

vsellier changed the status of subtask T3462: Add proxmox / ceph monitoring from Open to Work in Progress.

one other improvement may be to modify a bit the profile of the workers (to reduce the load on the ceph cluster):

  • lower the replication factor for workers' volumes (or even use local storage, but that requires some more storage on hypervisors we currently don't have),
  • (probably not very relevant but) stop having swap on workers (since this swap end up being on the ceph volume, so replicated etc.) (oh this has been done already, good)

(TBH my feeling is that workers should be disposable machines, not backed on ceph at all).

but that requires some more storage on hypervisors we currently don't have

Don't the hypervisors also serve as OSDs? We could just get a disk per hypervisor (partially?) out of the ceph cluster and use it for the workers' /tmp, or even their whole disk.

but that requires some more storage on hypervisors we currently don't have

Don't the hypervisors also serve as OSDs? We could just get a disk per hypervisor (partially?) out of the ceph cluster and use it for the workers' /tmp, or even their whole disk.

yes, there are 5 hypervisors, 3 of them being the ceph cluster (serving osd and being monitor).

Among possibilities, adding storage on those to make it available as local storage for vm is one of them. And having the whole disk vs just /tmp is another discussion. I would prefer having the whole disk since we don't really need workers to be extra reliable or hot movable on the proxmoxx cluster.

And we could also use zfs-backed thin provisionning for the / of workers to save storage space (and possibly help to ensure consistency of deployed workers... not extra convinced of this later point)

Ceph status go back to OK with these actions:

  • Cleanup the crash history
    • to check status:
ceph crash ls
cepg crash info <id>
  • to cleanup
ceph crash archive-all
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false