Puppet configuration changed in rSPSITE62784f5462586adb44541b6382b41c1863f8938c.
Changes applied to Azure hosts.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sep 20 2018
Sep 18 2018
Task finished by @olasd .
Sep 17 2018
Sep 13 2018
Sep 12 2018
An Apache instance on pergamon is providing http and/or https services for the following hosts:
- annex.softwareheritage.org_non-ssl
- debian.softwareheritage.org
- docs.softwareheritage.org
- grafana.softwareheritage.org
- icinga.softwareheritage.org
- pergamon:8140 (puppet)
Sep 10 2018
All VMs restarted with PCID and NUMA flags.
Sep 6 2018
ns0 VM created on Azure.
Sep 5 2018
All worker VMs on beaubourg restarted with the same settings.
All worker VMs on louvre restarted with NUMA and PCID flags.
They were resized from 16 to 12 GBs of RAM and from 4 to 3 CPU cores in order to waste less hypervisor resources.
numastat output on orsay, for reference:
node0 node1 node2 node3 numa_hit 154258622 106196783 173789251 218914560 numa_miss 0 0 0 0 numa_foreign 0 0 0 0 interleave_hit 6864 6821 6872 6826 local_node 154248017 106178817 173773345 218903410 other_node 10605 17966 15906 11150
numastat output on beaubourg, for reference:
node0 node1 numa_hit 24194141993 34805632693 numa_miss 6528825760 313114704 numa_foreign 313114704 6528825760 interleave_hit 44068 43188 local_node 24194499550 34805370119 other_node 6528468203 313377278
numastat output on louvre, for reference:
node0 node1 node2 node3 numa_hit 13497023257 14081211989 14852512306 17957276918 numa_miss 8599494310 7372048640 2126471863 2510901890 numa_foreign 4832232163 2059616197 4052656329 9664412014 interleave_hit 21033 20998 21022 20991 local_node 13497008321 14081146133 14852428849 17957202370 other_node 8599509246 7372114496 2126555320 2510976437
Sep 4 2018
Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.
Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.
Sep 3 2018
The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.
munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.
Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:
- Most VMs suffer from I/O wait issues since August 20, 2018
- Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
- Ceph nodes suffer from I/O wait
The last point is is not very surprising since their storage mostly consist of rotating disk drives.
Aug 31 2018
All documents reindexed.
Some legacy logstash-* indexes containing tens of thousands of invalid documents have been kept for further analysis.
Some existing Puppet manifests were causing important Apache modules to be removed from former Munin machines no longer providing Munin services.
This briefly caused all http/https services on Pergamon to fail but has since been fixed in rSPSITEa1ecf58d5d157d94bcb019a432e221da9a798f34.
- munin0 VM created
- munin service added to munin0, with munstrap alternative template
- legacy pergamon data copied to munin0
- Puppet manifests changed to use munin0 instead of pergamon for the Munin service
- Many useless statistics removed in order to reduce I/O load on the new VM
- All of that done in paring with ardumont@
logstash0 migrated to beaubourg as well (complete shutdown and restart included).
One of the impacted VM is logstash0:
[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort ... Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds. [Tue Aug 21 02:38:39 2018] Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3 ... Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort ...
Aug 30 2018
At least three important changes were made on 2018-08-20:
- Uffizi has been morphed from a Qemu VM to a lxc container
- VM storage has been migrated to a Ceph backend
- The Proxmox PVE Linux kernel has been updated
Aug 29 2018
Aug 28 2018
Aug 27 2018
Aug 24 2018
The current situation looks good enough for now, closing.
Aug 23 2018
Root filesystem successfully resized:
lvextend -v -L+20G /dev/vg-louvre/louvre-root resize2fs /dev/mapper/vg--louvre-louvre--root
Aug 20 2018
logstash appears to be a good substitute to the failing reindex Elasticsearch API.
It naturally skips documents whose contents appear to be invalid when sent to an Elasticsearch cluster.
Aug 3 2018
A new Kibana-dedicated VM, kibana0.internal.softwareheritage.org has also been created.
Closing.
Given Uffizi already uses 64GB of RAM (more than some physical machines), this should be a no brainer.
I am not sure if this would really improve I/O performance, though.
Aug 2 2018
Aug 1 2018
Logstash service moved to a new VM, logstash0.internal.softwareheritage.org.
Jul 31 2018
No new problem noticed since ~= a month, closing.
With the new VM and its logstash-6.3.2 service,
/var/log/apache2/archive.softwareheritage.org_non-ssl_access.log
contents are now successfully stored into Elasticsearch indexes.
A 6.3.x version of logstash is now running on logstash0.internal.softwareheritage.org , a brand new VM created on Proxmox/louvre.