26/07/2021: Unstuck infrastructure outage then post-mortem
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Jul 26 2021, 2:09 PM

Description

Unstuck infrastructure.

What happened so far:

Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
Then escalation and most public facing services went down
Analysis started on the 26/07 next morning
First: status.softwareheritage.org manually updated to notify the issue on channels
Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
IDRAC connection and serial console use to access hypervisor(s) and analyze trouble
Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation
deleted /var/log/ceph.log from all ceph monitors to free disk space
restarted ceph monitors
Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)
Progressive restart of VMs
This unstucks most services
Update status.softwareheritage.org with partial service disruption notification
Logs still dumping too much information and dangerously close to the initial issue though
Stopping workers
for host in {branly,hypervisor3,beaubourg} [3]
- Cleaning up voluminous logs
- Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
- Restart ceph-mon@<host>
- Restart ceph-osd@*
- Restart ceph-mgr@<host>
... Investigation continues and restarting services ongoing
vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]
Investigation did not identify yet the source of the issue
The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).

[1] P1099

[2] https://branly.internal.softwareheritage.org:8006/

[3] Our main hypervisors our infrastructure rely upon

[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped
replaying instructions which were dumping lots of log errors.

[5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1

[6] https://status.softwareheritage.org/

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3444 26/07/2021: Unstuck infrastructure outage then post-mortem
Migrated	gitlab-migration	T3460 Restore access to the gui of the passive firewall
Migrated	gitlab-migration	T3461 Prepare a quote for bare metal servers for the firewalls
Migrated	gitlab-migration	T3462 Add proxmox / ceph monitoring
Migrated	gitlab-migration	T3463 Ingest proxmox and ceph logs in elk
Migrated	gitlab-migration	T3474 Disable swap on workers
Migrated	gitlab-migration	T3476 One of the system disks of beaubourg is out of order
Migrated	gitlab-migration	T3499 Move firewall storage to local hypervisor storage

Event Timeline

ardumont changed the task status from Open to Work in Progress.Jul 26 2021, 2:09 PM

ardumont triaged this task as High priority.

ardumont created this task.

ardumont updated the task description. (Show Details)Jul 26 2021, 2:34 PM

ardumont added a project: System administration.

ardumont updated the task description. (Show Details)Jul 26 2021, 4:36 PM

vlorentz updated the task description. (Show Details)Jul 26 2021, 4:49 PM

vlorentz updated the task description. (Show Details)

ardumont updated the task description. (Show Details)Jul 26 2021, 4:51 PM

vlorentz updated the task description. (Show Details)Jul 26 2021, 4:52 PM

vlorentz updated the task description. (Show Details)

Potential issues/weakness of our current infra:

firewalls are virtual machines which root storage is on ceph, so when ceph is down, we have no easy way of connecting to the hypervisors/ceph machines

ceph is not properly monitored (ENOSPC should not get unoticed on these machines),

(some) sysadmin documentation should be accessible when most the infrastructure is down, especially adm (idrac) ip adresses

console access to any bare metal machine should be possible using cli tooling (ipmi)

console access to virtual machines should be possible from hypervisors (eg. qm terminal <vmid>, just a matter of starting getty@ttyS0.service on VMs)

Thanks for looking into this.
What about sending logs to a separate dedicated logging machine instead of storing them locally?

ardumont updated the task description. (Show Details)Jul 27 2021, 8:17 AM

ardumont mentioned this in T3445: Monitor hypervisors as well.Jul 27 2021, 8:31 AM

firewalls are virtual machines which root storage is on ceph, so when ceph is down, we
have no easy way of connecting to the hypervisors/ceph machines

Right. Discussing with @vsellier which helped a lot to untangle this mess. I recalled we
mentioned making those real hardware (and most likely machines not depending on ceph).

ceph is not properly monitored (ENOSPC should not get unoticed on these machines),

P1099 and further earlier logs from that moment do not seem to warn about this... T3445
got created for this.

(some) sysadmin documentation should be accessible when most the infrastructure is
down, especially adm (idrac) ip adresses

That bumps in priority the moving sysadm docs to sphinx task [1] (which is deployed to
docs.softwareheritage.org) [2]. If it goes down, we should still have access to a local
git repository swh-docs.

console access to any bare metal machine should be possible using cli tooling (ipmi)

Yes, and we something for this on "unavailable" docs at the time [3]
Although, for some reason at that time, that would not work...

console access to virtual machines should be possible from hypervisors (eg. qm
terminal <vmid>, just a matter of starting getty@ttyS0.service on VMs)

Thanks for the heads up on this.

What about sending logs to a separate dedicated logging machine instead of storing them locally?

That'd make sense. We do this for the rest of the infrastructure. Machines send their
logs to a vm logstash0 vm machine that pushes logs to elasticsearch.

Maybe we should evolve that scheme on dedicated machines as well.

[1] T3154

[2] docs.s.o was still accessible through this issue.

[3] https://intranet.softwareheritage.org/wiki/IDRAC#Connect_to_the_serial_console

ardumont mentioned this in T3154: sysadm docs: Move relevant and public doc from intranet to swh-docs.Jul 27 2021, 8:50 AM

In T3444#67730, @ardumont wrote:

ceph is not properly monitored (ENOSPC should not get unoticed on these machines),

P1099 and further earlier logs from that moment do not seem to warn about this... T3945
got created for this.

On this side, it's worth saying that when things started to go sideways, the ENOSPC occurred so quick no notification/alert system would have prevented it (until we have 24/7 on duty sysadmin team).

What about sending logs to a separate dedicated logging machine instead of storing them locally?

That'd make sense. We do this for the rest of the infrastructure. Machines send their
logs to a vm logstash0 vm machine that pushes logs to elasticsearch.

Maybe we should evolve that scheme on dedicated machines as well.

I used to use a (simple rsyslog config, non es based) central log system in my previous life, was pretty useful (I have hard time using logstash TBH, but the system was not as complex as the SWH stack) , but you always want to keep a local copy of the logs on machines (at least the last x hours of logs), so that does not really answer the kind of situation we went through yesterday (unless there is an efficient way of setting the logging system to only keep x amount a space locally).

Also, in this case, the log file was not generated by the system's logging service but directly by a ceph daemon (if I'm not mistaken).

unless there is an efficient way of setting the logging system to only keep x amount a space locally

we could configure logrotate to also rotate based on file size (/var/log/ceph.log is already logrotated, but only based on time). It does not guarantee there will always be free space, but rotating on size would have delayed the issue by at least a few hours, and may have triggered monitoring alerts, as it would cause the disk space to spike from time to time.

ardumont moved this task from Backlog to in-progress on the System administration board.Jul 27 2021, 6:24 PM

ardumont updated the task description. (Show Details)Jul 27 2021, 6:38 PM

Unknown Object (User) added a subscriber: Unknown Object (User).Jul 28 2021, 8:46 AM

This comment was removed by ardumont.

ardumont renamed this task from 26/07/2021: Unstuck infrastructure which went down then post-mortem to 26/07/2021: Unstuck infrastructure outage then post-mortem.Jul 28 2021, 9:08 AM

ardumont moved this task from in-progress to code-review/await-feedback/pause on the System administration board.Jul 30 2021, 11:22 AM

FTR I've tried to investigate a bit to find clues of what the origin of the outage was, but I did not find any obvious culprit.

I have not been able to dig in fw machines logs however (can not access them).

my understanding of the logs is that three OSDs (osd.3, osd.11, and osd.14 IIRC) were suddenly all stuck "waiting for subops". Unfortunately, while this is easy to debug while the issue is happening (dump_ops_in_flight), I didn't think of it at the time, given the hurry to fix the issue...

So we are left to guess.

This can usually mean two things:

disk IO is extremely slow on that OSD
this OSD is the leader of a given PG and waiting for replication from the other OSDs

(note that option 2 on an OSD can be caused by option 1 on a replica)

However, I think we can rule out option 1, because, say, if osd.3 was stuck, it would be stuck for all PGs, so other OSDs than osd.11 and osd.14 that replicate its PGs would be erroring too.

As to what could cause issue 2... my guesses are:

two stuck TCP connections between these OSDs (but other OSDs on the same machines were fine, so probably not a low-level issue??)
bug in one or all three of these OSDs that caused one of their threads to be stuck
(unlikely) bug in a monitor that somehow made these OSDs in inconsistent states?

I have not been able to dig in fw machines logs however

wasn't their issue entirely caused by Ceph being down?

ardumont updated the task description. (Show Details)Aug 4 2021, 11:35 AM

we looked with @ardumont if we can found anything relevant relative to the incident.
The osd logs were rotated and removed from the servers so there are nothing that can help to diagnose the problem.
This shows it's important to send all the logs to a third party like elk.

So after digging a little, so actions may be needed: (dedicated task will be created)

deploy filebeat on the hypervisor and send ceph and proxmox log to the elk cluster. It can be send to dedicated indices if we want to manage the data retention finely
now the firewalls are the center of the infrastructure, they should be converted to bare metal servers to reduce the 3d party dependency. We may prepare a quote for 2 small servers.
add ceph monitoring. possibly by activating the node-exporter plugin: https://docs.ceph.com/en/nautilus/mgr/prometheus/) and/or use the ceph -s output to create icinga alerts
check if a relevant proxmox monitoring can be created (try to detect stalled vm for example)
Add an external monitoring probe to monitor the availability of archive.s.o if icinga if down too

Not directly related, but also important:

there is a problem on the firewall with an openvpn internal rule blocking the access to the gui of the standby firewall, it should be fixed.
tag the logs per environment to make the search easier on kibana (staging / production + location rocq / azure)

During the investigation we have found some interesting stuff:

some ceph/proxmox logs are already ingested in elk through journalbeat (osd, pve) [1]
when the vms start to be unresponsive, an identifiable log start to appear in the hypervisor logs relative to qmp timeout. Perhaps we could use this log to generate an critical icinga alert
we discovered an interesting command pvereport generating a complete report of the ceph cluster. Not sure il would have worked during the incident but there is a lot of interesting information in the output.

[1] http://kibana0.internal.softwareheritage.org:5601/goto/5248c5e6a2bde55df3124a4e7cdd72ab

vsellier added a subtask: T3461: Prepare a quote for bare metal servers for the firewalls.Aug 4 2021, 6:39 PM

vsellier changed the status of subtask T3460: Restore access to the gui of the passive firewall from Open to Work in Progress.Aug 5 2021, 2:34 PM

vsellier closed subtask T3460: Restore access to the gui of the passive firewall as Resolved.Aug 5 2021, 5:49 PM

vsellier created subtask T3474: Disable swap on workers.Aug 10 2021, 9:54 AM

vsellier closed subtask T3474: Disable swap on workers as Resolved.Aug 10 2021, 5:15 PM

vsellier changed the status of subtask T3462: Add proxmox / ceph monitoring from Open to Work in Progress.

ardumont closed subtask T3462: Add proxmox / ceph monitoring as Resolved.Aug 12 2021, 4:56 PM

one other improvement may be to modify a bit the profile of the workers (to reduce the load on the ceph cluster):

lower the replication factor for workers' volumes (or even use local storage, but that requires some more storage on hypervisors we currently don't have),
(probably not very relevant but) stop having swap on workers (since this swap end up being on the ceph volume, so replicated etc.) (oh this has been done already, good)

(TBH my feeling is that workers should be disposable machines, not backed on ceph at all).

but that requires some more storage on hypervisors we currently don't have

Don't the hypervisors also serve as OSDs? We could just get a disk per hypervisor (partially?) out of the ceph cluster and use it for the workers' /tmp, or even their whole disk.

In T3444#68653, @vlorentz wrote:

but that requires some more storage on hypervisors we currently don't have

Don't the hypervisors also serve as OSDs? We could just get a disk per hypervisor (partially?) out of the ceph cluster and use it for the workers' /tmp, or even their whole disk.

yes, there are 5 hypervisors, 3 of them being the ceph cluster (serving osd and being monitor).

Among possibilities, adding storage on those to make it available as local storage for vm is one of them. And having the whole disk vs just /tmp is another discussion. I would prefer having the whole disk since we don't really need workers to be extra reliable or hot movable on the proxmoxx cluster.

And we could also use zfs-backed thin provisionning for the / of workers to save storage space (and possibly help to ensure consistency of deployed workers... not extra convinced of this later point)

Ceph status go back to OK with these actions:

Cleanup the crash history
- to check status:

ceph crash ls
cepg crash info <id>

to cleanup

ceph crash archive-all

Remove the mons are allowing insecure global_id reclaim error:
- apply this configuration: https://ceph.io/releases/v14-2-20-nautilus-released/

ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false

vsellier closed subtask T3499: Move firewall storage to local hypervisor storage as Resolved.Aug 24 2021, 2:29 PM

vsellier changed the status of subtask T3476: One of the system disks of beaubourg is out of order from Open to Work in Progress.Aug 24 2021, 2:43 PM

vsellier closed subtask T3476: One of the system disks of beaubourg is out of order as Resolved.Sep 15 2021, 8:32 AM

@olasd, @vsellier any no go against closing this? We are kind of done with this, aren't we?

it's good for me to close it.

olasd closed this task as Resolved.Dec 9 2021, 5:16 PM

ardumont moved this task from code-review/await-feedback/pause to done on the System administration board.May 23 2022, 2:26 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T3460: Restore access to the gui of the passive firewall from Resolved to Migrated.Oct 19 2022, 6:03 PM

gitlab-migration closed subtask T3461: Prepare a quote for bare metal servers for the firewalls as Migrated.

gitlab-migration changed the status of subtask T3462: Add proxmox / ceph monitoring from Resolved to Migrated.

gitlab-migration closed subtask T3463: Ingest proxmox and ceph logs in elk as Migrated.

gitlab-migration changed the status of subtask T3474: Disable swap on workers from Resolved to Migrated.

gitlab-migration changed the status of subtask T3476: One of the system disks of beaubourg is out of order from Resolved to Migrated.

gitlab-migration changed the status of subtask T3499: Move firewall storage to local hypervisor storage from Resolved to Migrated.

26/07/2021: Unstuck infrastructure outage then post-mortemClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

26/07/2021: Unstuck infrastructure outage then post-mortem
Closed, MigratedEdits Locked
Actions

Related Objects
Search...