Page MenuHomeSoftware Heritage

Huge slowdowns on louvre since 2018-08-20
Started, Work in Progress, HighPublic

Description

The louvre hypervisor has seen tremendous slowdowns since 2018-08-20.
Some VMs completely froze for minutes at a time and had to be migrated to beaubourg.

Event Timeline

ftigeot created this task.Aug 30 2018, 11:06 AM
ftigeot triaged this task as Normal priority.

At least three important changes were made on 2018-08-20:

  • Uffizi has been morphed from a Qemu VM to a lxc container
  • VM storage has been migrated to a Ceph backend
  • The Proxmox PVE Linux kernel has been updated

One of the impacted VM is logstash0:

[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort
...
Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds.
[Tue Aug 21 02:38:39 2018]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3
...
Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort
...

Some puppet manifests no longer behave as expected on this machine, possibly due do some form of disk corruption.

ftigeot added a comment.EditedAug 31 2018, 10:45 AM

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

And latest puppet agent --test now successfully writes the password changes it tried to apply time and again [1]

[1] https://forge.softwareheritage.org/P293$19-28

ftigeot changed the task status from Open to Work in Progress.Sep 3 2018, 2:25 PM

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

  • Most VMs suffer from I/O wait issues since August 20, 2018
  • Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
  • Ceph nodes suffer from I/O wait

The last point is is not very surprising since their storage mostly consist of rotating disk drives.

ftigeot added a comment.EditedSep 3 2018, 2:26 PM

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

ftigeot changed the status of subtask T1166: Split up pergamon to smaller VMs from Open to Work in Progress.Sep 4 2018, 12:03 PM
zack raised the priority of this task from Normal to Unbreak Now!.Sep 4 2018, 3:02 PM

Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.

Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

ftigeot changed the status of subtask T1176: Enable NUMA and PCID options on all VMs from Open to Work in Progress.Sep 5 2018, 5:02 PM

PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.

zack lowered the priority of this task from Unbreak Now! to High.Oct 5 2018, 11:16 AM