Huge slowdowns on louvre since 2018-08-20
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ftigeot
	Aug 30 2018, 11:06 AM

Description

The louvre hypervisor has seen tremendous slowdowns since 2018-08-20.
Some VMs completely froze for minutes at a time and had to be migrated to beaubourg.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1173 Huge slowdowns on louvre since 2018-08-20
Migrated	gitlab-migration	T1069 fully host the web UI on Azure
Migrated	gitlab-migration	T1073 Remove temporary table usage for read-only queries
Migrated	gitlab-migration	T1079 Build a new azure node for the web ui replacement
Migrated	gitlab-migration	T1094 swh-indexer db replica on azure
Migrated	gitlab-migration	T1095 indexer: Remove temporary table usage for read-only queries
Migrated	gitlab-migration	T1113 Update streaming replication documentation
Migrated	gitlab-migration	T1116 Azure webapp performance tests
Migrated	gitlab-migration	T1127 dbreplica1 2018-06-30 event postmortem
Migrated	gitlab-migration	T1128 Remove the public IP address on dbreplica1
Migrated	gitlab-migration	T1166 Split up pergamon to smaller VMs
Migrated	gitlab-migration	T1168 Move away the Munin service from Pergamon
Migrated	gitlab-migration	T1176 Enable NUMA and PCID options on all VMs
Migrated	gitlab-migration	T1392 Add a new hypervisor
Migrated	gitlab-migration	T1503 Rename hypervisor3 to a museum name

Event Timeline

ftigeot triaged this task as Normal priority.Aug 30 2018, 11:06 AM

ftigeot created this task.

At least three important changes were made on 2018-08-20:

Uffizi has been morphed from a Qemu VM to a lxc container
VM storage has been migrated to a Ceph backend
The Proxmox PVE Linux kernel has been updated

One of the impacted VM is logstash0:

[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort
...
Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds.
[Tue Aug 21 02:38:39 2018]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3
...
Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort
...

Some puppet manifests no longer behave as expected on this machine, possibly due do some form of disk corruption.

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

And latest puppet agent --test now successfully writes the password changes it tried to apply time and again [1]

[1] https://forge.softwareheritage.org/P293$19-28

ardumont added subtasks: T1069: fully host the web UI on Azure, T1168: Move away the Munin service from Pergamon.Aug 31 2018, 4:21 PM

ftigeot added a subtask: T1166: Split up pergamon to smaller VMs.Aug 31 2018, 4:23 PM

ardumont removed a subtask: T1168: Move away the Munin service from Pergamon.Aug 31 2018, 4:26 PM

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

Most VMs suffer from I/O wait issues since August 20, 2018
Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
Ceph nodes suffer from I/O wait

The last point is is not very surprising since their storage mostly consist of rotating disk drives.

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

ftigeot changed the status of subtask T1166: Split up pergamon to smaller VMs from Open to Work in Progress.Sep 4 2018, 12:03 PM

zack raised the priority of this task from Normal to Unbreak Now!.Sep 4 2018, 3:02 PM

Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.

Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

ftigeot changed the status of subtask T1176: Enable NUMA and PCID options on all VMs from Open to Work in Progress.Sep 5 2018, 5:02 PM

ftigeot closed subtask T1176: Enable NUMA and PCID options on all VMs as Resolved.Sep 10 2018, 4:33 PM

PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.

ardumont closed subtask T1069: fully host the web UI on Azure as Resolved.Oct 5 2018, 10:33 AM

zack lowered the priority of this task from Unbreak Now! to High.Oct 5 2018, 11:16 AM

ftigeot created subtask T1392: Add a new hypervisor.Nov 27 2018, 4:42 PM

ftigeot mentioned this in T1526: Install a new VPN endpoint at Rocquencourt.Feb 12 2019, 1:20 PM

ftigeot closed subtask T1392: Add a new hypervisor as Resolved.Apr 15 2019, 4:51 PM

"louvre" is not a hypervisor any longer.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T1176: Enable NUMA and PCID options on all VMs from Resolved to Migrated.Oct 19 2022, 5:54 PM

gitlab-migration changed the status of subtask T1392: Add a new hypervisor from Resolved to Migrated.

Huge slowdowns on louvre since 2018-08-20Closed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Huge slowdowns on louvre since 2018-08-20
Closed, MigratedEdits Locked
Actions

Related Objects
Search...