All worker VMs on louvre restarted with NUMA and PCID flags.
They were resized from 16 to 12 GBs of RAM and from 4 to 3 CPU cores in order to waste less hypervisor resources.

Sep 5 2018, 4:14 PM · System administration

ftigeot renamed T1176: Enable NUMA and PCID options on all VMs from Enable NUMA option on all VMs to Enable NUMA and PCID options on all VMs.

Sep 5 2018, 2:39 PM · System administration

ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on orsay, for reference:

                           node0           node1           node2           node3
numa_hit               154258622       106196783       173789251       218914560
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              6864            6821            6872            6826
local_node             154248017       106178817       173773345       218903410
other_node                 10605           17966           15906           11150

Sep 5 2018, 11:49 AM · System administration

ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on beaubourg, for reference:

                           node0           node1
numa_hit             24194141993     34805632693
numa_miss             6528825760       313114704
numa_foreign           313114704      6528825760
interleave_hit             44068           43188
local_node           24194499550     34805370119
other_node            6528468203       313377278

Sep 5 2018, 11:42 AM · System administration

ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on louvre, for reference:

                           node0           node1           node2           node3
numa_hit             13497023257     14081211989     14852512306     17957276918
numa_miss             8599494310      7372048640      2126471863      2510901890
numa_foreign          4832232163      2059616197      4052656329      9664412014
interleave_hit             21033           20998           21022           20991
local_node           13497008321     14081146133     14852428849     17957202370
other_node            8599509246      7372114496      2126555320      2510976437

Sep 5 2018, 11:38 AM · System administration

ftigeot triaged T1176: Enable NUMA and PCID options on all VMs as Unbreak Now! priority.

Sep 5 2018, 11:01 AM · System administration

Sep 4 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

Sep 4 2018, 4:10 PM · System administration

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.

Sep 4 2018, 3:39 PM · System administration

ftigeot committed rSPSITE82aa877ec064: dbreplica1.euwest.azure: Make Postgres listen on default port (authored by ftigeot).

dbreplica1.euwest.azure: Make Postgres listen on default port

Sep 4 2018, 2:11 PM

ftigeot changed the status of T1166: Split up pergamon to smaller VMs from Open to Work in Progress.

Sep 4 2018, 12:03 PM · System administration

ftigeot changed the status of T1166: Split up pergamon to smaller VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, from Open to Work in Progress.

Sep 4 2018, 12:03 PM · System administration

Sep 3 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

Sep 3 2018, 2:29 PM · System administration

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

Sep 3 2018, 2:26 PM · System administration

ftigeot changed the status of T1173: Huge slowdowns on louvre since 2018-08-20 from Open to Work in Progress.

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

Most VMs suffer from I/O wait issues since August 20, 2018
Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
Ceph nodes suffer from I/O wait

The last point is is not very surprising since their storage mostly consist of rotating disk drives.

Sep 3 2018, 2:25 PM · System administration

Aug 31 2018

ftigeot closed T1000: Reindex old data on banco to put it into swh_worker indexes as Resolved.

All documents reindexed.
Some legacy logstash-* indexes containing tens of thousands of invalid documents have been kept for further analysis.

Aug 31 2018, 5:37 PM · System administration

ftigeot closed T1000: Reindex old data on banco to put it into swh_worker indexes, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, as Resolved.

Aug 31 2018, 5:37 PM · System administration (Elasticsearch consolidation (W24/2018))

ftigeot added a parent task for T1166: Split up pergamon to smaller VMs: T1173: Huge slowdowns on louvre since 2018-08-20.

Aug 31 2018, 4:23 PM · System administration

ftigeot added a subtask for T1173: Huge slowdowns on louvre since 2018-08-20: T1166: Split up pergamon to smaller VMs.

Aug 31 2018, 4:23 PM · System administration

ftigeot added a comment to T1168: Move away the Munin service from Pergamon.

Some existing Puppet manifests were causing important Apache modules to be removed from former Munin machines no longer providing Munin services.
This briefly caused all http/https services on Pergamon to fail but has since been fixed in rSPSITEa1ecf58d5d157d94bcb019a432e221da9a798f34.

Aug 31 2018, 11:00 AM · System administration

ftigeot closed T1168: Move away the Munin service from Pergamon as Resolved.

munin0 VM created
munin service added to munin0, with munstrap alternative template
legacy pergamon data copied to munin0
Puppet manifests changed to use munin0 instead of pergamon for the Munin service
Many useless statistics removed in order to reduce I/O load on the new VM
All of that done in paring with ardumont@

Aug 31 2018, 10:57 AM · System administration

ftigeot closed T1168: Move away the Munin service from Pergamon, a subtask of T1166: Split up pergamon to smaller VMs, as Resolved.

Aug 31 2018, 10:57 AM · System administration

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

Aug 31 2018, 10:45 AM · System administration

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

One of the impacted VM is logstash0:

[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort
...
Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds.
[Tue Aug 21 02:38:39 2018]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3
...
Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort
...

Aug 31 2018, 10:39 AM · System administration

Aug 30 2018

ftigeot committed rSPSITEa1ecf58d5d15: Pergamon role: :apache::mod::rewrite MUST be included (authored by ftigeot).

Pergamon role: :apache::mod::rewrite MUST be included

Aug 30 2018, 4:37 PM

ftigeot committed rSPSITE467881fbfadd: Pergamon role: remove munin-master service (authored by ftigeot).

Pergamon role: remove munin-master service

Aug 30 2018, 3:59 PM

ftigeot committed rSPSITE2bfec1840036: Munin node: Stop allowing Pergamon connections (authored by ftigeot).

Munin node: Stop allowing Pergamon connections

Aug 30 2018, 3:59 PM

ftigeot committed rSPSITE6ff03f261a21: Munin master: switch to munin0.internal (authored by ftigeot).

Munin master: switch to munin0.internal

Aug 30 2018, 3:42 PM

ftigeot committed rSPSITEc0edc18fa018: Munin node: remove more borderline useless data (authored by ftigeot).

Munin node: remove more borderline useless data

Aug 30 2018, 2:07 PM

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

At least three important changes were made on 2018-08-20:

Uffizi has been morphed from a Qemu VM to a lxc container
VM storage has been migrated to a Ceph backend
The Proxmox PVE Linux kernel has been updated

Aug 30 2018, 11:08 AM · System administration

ftigeot triaged T1173: Huge slowdowns on louvre since 2018-08-20 as Normal priority.

Aug 30 2018, 11:06 AM · System administration

Aug 29 2018

ftigeot committed rSPSITE03d62de868ee: Munin node: remove more useless statistics by default (authored by ftigeot).

Munin node: remove more useless statistics by default

Aug 29 2018, 5:49 PM

ftigeot committed rSPSITEda7f7d8afd31: Azure euwest: some DNS forwarders do not support DNSSEC (authored by ftigeot).

Azure euwest: some DNS forwarders do not support DNSSEC

Aug 29 2018, 1:16 PM

Aug 28 2018

ftigeot committed rSPSITEecd79a590cdd: Git loader: decrease concurrency to 2 (authored by ftigeot).

Git loader: decrease concurrency to 2

Aug 28 2018, 3:49 PM

ftigeot committed rSPSITEcf2bd41d01c4: Deposit backend workers: increase concurrency to 8 (authored by ftigeot).

Deposit backend workers: increase concurrency to 8

Aug 28 2018, 3:44 PM

ftigeot committed rSPSITEbd51a2fecd05: Scheduler backend workers: increase concurrency to 16 (authored by ftigeot).

Scheduler backend workers: increase concurrency to 16

Aug 28 2018, 3:38 PM

ftigeot committed rSPSITEf78b9e1d1b64: Munin master: Stop auto-generating CNAME records (authored by ftigeot).

Munin master: Stop auto-generating CNAME records

Aug 28 2018, 2:56 PM

ftigeot committed rSPSITE782fabad1b99: Munin: Allow munin0 connections to nodes (authored by ftigeot).

Munin: Allow munin0 connections to nodes

Aug 28 2018, 2:38 PM

ftigeot committed rSPSITE3b5b70f73d6e: swh_munin_master role: No need for public IP address (authored by ftigeot).

swh_munin_master role: No need for public IP address

Aug 28 2018, 2:27 PM

ftigeot committed rSPSITE262f6e29e433: Fix munin_master role (authored by ftigeot).

Fix munin_master role

Aug 28 2018, 2:19 PM

ftigeot committed rSPSITE586486b4d057: Add a munin_master role (authored by ftigeot).

Add a munin_master role

Aug 28 2018, 2:15 PM

ftigeot committed rSPSITE680470cbc810: data/defaults.yaml: Add munin0.internal.softwareheritge.org (authored by ftigeot).

data/defaults.yaml: Add munin0.internal.softwareheritge.org

Aug 28 2018, 1:14 PM

Aug 27 2018

ftigeot triaged T1168: Move away the Munin service from Pergamon as Normal priority.

Aug 27 2018, 2:29 PM · System administration

Aug 24 2018

ftigeot renamed T1166: Split up pergamon to smaller VMs from Split up pergamon in smaller VMs to Split up pergamon to smaller VMs.

Aug 24 2018, 3:01 PM · System administration

ftigeot triaged T1166: Split up pergamon to smaller VMs as Normal priority.

Aug 24 2018, 2:48 PM · System administration

ftigeot closed T1165: Fix lack of disk space on louvre:/ as Resolved.

The current situation looks good enough for now, closing.

Aug 24 2018, 2:37 PM · System administration

ftigeot closed T1165: Fix lack of disk space on louvre:/, a subtask of T1164: Dar backups fill up disk space on client machines, as Resolved.

Aug 24 2018, 2:37 PM · System administration

Aug 23 2018

ftigeot changed the status of T1165: Fix lack of disk space on louvre:/ from Open to Work in Progress.

Root filesystem successfully resized:

lvextend -v -L+20G /dev/vg-louvre/louvre-root
resize2fs /dev/mapper/vg--louvre-louvre--root

Aug 23 2018, 2:18 PM · System administration

ftigeot changed the status of T1165: Fix lack of disk space on louvre:/, a subtask of T1164: Dar backups fill up disk space on client machines, from Open to Work in Progress.

Aug 23 2018, 2:18 PM · System administration

ftigeot triaged T1165: Fix lack of disk space on louvre:/ as High priority.

Aug 23 2018, 1:49 PM · System administration

Aug 20 2018

ftigeot triaged T1164: Dar backups fill up disk space on client machines as High priority.

Aug 20 2018, 11:36 AM · System administration

ftigeot added a comment to T1000: Reindex old data on banco to put it into swh_worker indexes.

logstash appears to be a good substitute to the failing reindex Elasticsearch API.
It naturally skips documents whose contents appear to be invalid when sent to an Elasticsearch cluster.

Aug 20 2018, 11:36 AM · System administration

Aug 3 2018

ftigeot closed T1126: Move away non-gunicorn services from banco as Resolved.

A new Kibana-dedicated VM, kibana0.internal.softwareheritage.org has also been created.
Closing.

Aug 3 2018, 5:42 PM · System administration

ftigeot added a comment to T1048: Clean striped object storages from objects they should not be containing.

Given Uffizi already uses 64GB of RAM (more than some physical machines), this should be a no brainer.
I am not sure if this would really improve I/O performance, though.

Aug 3 2018, 3:03 PM · Object storage

Aug 2 2018

ftigeot committed rSPSITE154ffdabc829: kibana role: Add configuration file (authored by ftigeot).

kibana role: Add configuration file

Aug 2 2018, 4:06 PM

ftigeot committed rSPSITEf483b3c33c2b: groups: Add anlambert to swhwebapp (authored by ftigeot).

groups: Add anlambert to swhwebapp

Aug 2 2018, 2:20 PM

Aug 1 2018

ftigeot committed rSPSITE84fa04e727e0: kibana role: trivial bugfix (authored by ftigeot).

kibana role: trivial bugfix

Aug 1 2018, 2:32 PM

ftigeot committed rSPSITEcbfbe17d06ac: Add a kibana role (authored by ftigeot).

Add a kibana role

Aug 1 2018, 2:24 PM

ftigeot committed rSPSITEb09ce754d02e: dns: Add kibana0.internal.softwareheritage.org (authored by ftigeot).

dns: Add kibana0.internal.softwareheritage.org

Aug 1 2018, 1:30 PM

ftigeot changed the status of T1126: Move away non-gunicorn services from banco from Open to Work in Progress.

Logstash service moved to a new VM, logstash0.internal.softwareheritage.org.

Aug 1 2018, 11:13 AM · System administration

Jul 31 2018

ftigeot committed rSPSITEa8b60051ddc6: Apply the swh_lsi_storage_adapter role to all relevant hosts (authored by ftigeot).

Apply the swh_lsi_storage_adapter role to all relevant hosts

Jul 31 2018, 4:29 PM

ftigeot added a parent task for T1028: deposit: Push logs to elasticsearch: T791: Ship more logs to logstash/elasticsearch.

Jul 31 2018, 4:19 PM · SWORD deposit

ftigeot added a subtask for T791: Ship more logs to logstash/elasticsearch: T1028: deposit: Push logs to elasticsearch.

Jul 31 2018, 4:19 PM · System administration

ftigeot added a parent task for T1005: webapp: Push logs to elasticsearch cluster: T791: Ship more logs to logstash/elasticsearch.

Jul 31 2018, 4:18 PM · System administration, Web app

ftigeot added a subtask for T791: Ship more logs to logstash/elasticsearch: T1005: webapp: Push logs to elasticsearch cluster.

Jul 31 2018, 4:18 PM · System administration

ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T1028: deposit: Push logs to elasticsearch, from Open to Work in Progress.

Jul 31 2018, 4:13 PM · SWORD deposit

ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster from Open to Work in Progress.

Jul 31 2018, 4:13 PM · System administration (Elasticsearch consolidation (W24/2018))

ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T1005: webapp: Push logs to elasticsearch cluster, from Open to Work in Progress.

Jul 31 2018, 4:13 PM · System administration, Web app

ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T986: Scheduler: Automate completed oneshot or disabled recurring tasks archival, from Open to Work in Progress.

Jul 31 2018, 4:13 PM · Scheduling utilities

ftigeot closed T1127: dbreplica1 2018-06-30 event postmortem as Resolved.

No new problem noticed since ~= a month, closing.

Jul 31 2018, 4:03 PM · Web app, System administration

ftigeot closed T1127: dbreplica1 2018-06-30 event postmortem, a subtask of T1069: fully host the web UI on Azure, as Resolved.

Jul 31 2018, 4:03 PM · Web app, System administration

ftigeot added a comment to T791: Ship more logs to logstash/elasticsearch.

With the new VM and its logstash-6.3.2 service,

/var/log/apache2/archive.softwareheritage.org_non-ssl_access.log

contents are now successfully stored into Elasticsearch indexes.

Jul 31 2018, 4:02 PM · System administration

ftigeot closed T1160: Create a dedicated logstash VM as Resolved.

A 6.3.x version of logstash is now running on logstash0.internal.softwareheritage.org , a brand new VM created on Proxmox/louvre.

Jul 31 2018, 2:38 PM · System administration

ftigeot closed T1160: Create a dedicated logstash VM, a subtask of T791: Ship more logs to logstash/elasticsearch, as Resolved.

Jul 31 2018, 2:38 PM · System administration

ftigeot closed T1160: Create a dedicated logstash VM, a subtask of T1126: Move away non-gunicorn services from banco, as Resolved.

Jul 31 2018, 2:38 PM · System administration

ftigeot committed rSPSITE36ecf9cf4631: Elastic packages: Upgrade to version 6.3.2 (authored by ftigeot).

Elastic packages: Upgrade to version 6.3.2

Jul 31 2018, 2:05 PM

ftigeot committed rSPSITEdd33071544f5: dns: Make logstash0 the new default logstash instance (authored by ftigeot).

dns: Make logstash0 the new default logstash instance

Jul 31 2018, 11:55 AM

ftigeot committed rSPSITE4f76c0fb1d21: logstash: Add configuration files, enable service (authored by ftigeot).

logstash: Add configuration files, enable service

Jul 31 2018, 11:14 AM

Jul 30 2018

ftigeot committed rSPSITE61cdc0325944: Add a logstash role (authored by ftigeot).

Add a logstash role

Jul 30 2018, 11:10 AM

Jul 27 2018

ftigeot changed the status of T1160: Create a dedicated logstash VM, a subtask of T791: Ship more logs to logstash/elasticsearch, from Open to Work in Progress.

Jul 27 2018, 4:58 PM · System administration

ftigeot changed the status of T1160: Create a dedicated logstash VM from Open to Work in Progress.

Jul 27 2018, 4:58 PM · System administration

Advanced SearchUse ResultsEdit QueryHide Query

Sep 20 2018

Sep 18 2018

Sep 17 2018

Sep 13 2018

Sep 12 2018

Sep 10 2018

Sep 6 2018

Sep 5 2018

Sep 4 2018

Sep 3 2018

Aug 31 2018

Aug 30 2018

Aug 29 2018

Aug 28 2018

Aug 27 2018

Aug 24 2018

Aug 23 2018

Aug 20 2018

Aug 3 2018

Aug 2 2018

Aug 1 2018

Jul 31 2018

Jul 30 2018

Jul 27 2018

Advanced Search
Use Results
Edit Query
Hide Query