- User Since
- Sep 6 2017, 1:06 PM (65 w, 5 d)
Fri, Dec 7
Borgbackup is unable to pull data from remote hosts to a central location.
I do not understand this assertion.
Tue, Dec 4
- I/Os per device
- Disk usage in percent
- Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
- Disk usage in absolute human values. percentages are meaningless if we resize filesystems
There is a huge difference between Borgbackup and Rsnapshot + Backuppc: Borgbackup is unable to pull data from remote hosts to a central location.
Its working model is based on Borgbackup running locally and storing data to a local filesystem.
New hypervisor hardware has been racked in our bay at Rocquencourt.
The machine's iDrac management interface is accessible on the management network, under the name swh7-adm.inria.fr (details on the wiki).
Service email@example.com has been restarted on somerset and database replication is once again operating normally.
Postgres wal files are being removed as expected on the master, slowly freeing disk space.
Mon, Dec 3
Some no longer useful dump files were removed by seirl@, freeing some space on somerset:/srv/softwareheritage/postgres .
somerset:softwareheritage-indexer is the master database for dbreplica1:softwareheritage-indexer.
The pvmove command was done this morning.
Tue, Nov 27
Fri, Nov 23
At least some of the batteries for PERC H800 adapters use part number KR174 and/or M164C.
Some information leads me to believe they could also be used with PERC H700 adapters.
I did some experiments with Letsencrypt but other things were more urgent during the September-October 2018 period and in the end a wildcard Digicert certificate was used again instead.
Thu, Nov 22
Tue, Nov 20
Fri, Nov 16
Batteries for PERC H700 adapters have the part number U8735.
Thu, Nov 15
Orsay contains two LSI SAS 2108-based RAID adapters:
05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 22:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
Since the SSDs we have are 2.5", we need a special adapter disk tray, which Dell refuses to sell us.
Wed, Nov 14
Tue, Nov 13
In summary, only orsay has a failed BBU.
Given the fact that it contains two identical RAID adapters with old-age, similar BBUs, it could be useful to change both at once.
List of physical machines at Rocquencourt: louvre beaubourg orsay banco
Nov 7 2018
Oct 23 2018
Oct 22 2018
The existing dar(1) based system is not reliable.
Oct 19 2018
Icinga2 service monitoring changes pushed in commit rSPSITE76d7d90c51e0, based on the initial script linked by olasd@.
Oct 18 2018
Elasticsearch, Logstash and Kibana are now released together and similar versions are sure to be compatible. It makes sense to have a global Puppet constant defining which general elk stack version to use for packages.
A quick analysis of the 6.4.x family versions show they bring significant bug fixes to the table.
One particularly interesting aspect is general cluster reliability improvements when nodes leaves or come back to the cluster.
Oct 17 2018
Upgrading the Elasticsearch cluster is a somewhat delicate operation since nodes running old Elasticsearch versions can no longer store new data but it is not really difficult to handle properly.
The biggest issue could be with Kibana / Elasticsearch interactions: some old Kibana versions are known to stop displaying dashboards when talking to newer Elasticsearch servers.
Oct 9 2018
Oct 8 2018
Correct me if I am wrong, but I do not believe the current Puppet code has the ability to handle more than one NS record per zone.
At the very least, I couldn't find an obvious way to add such a record.
All known SSL services now use updated certificates. Closing.
Oct 3 2018
www and www-dev.softwareheritage.org now use auto-generated Gandi certificates.
Updated certificate uploaded to the Puppet repository and internal hosts updated.
Oct 2 2018
PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.
Sep 25 2018
Existing CSR data submitted again today to the secret INRIA/Digicert URL.
Sep 21 2018
Right now, the internal.softwareheritage.org zone contains only a single NS record. This is most likely also the case for the various reverse zones.
There is no explicit notification directive in the master server configuration either.
Sep 20 2018
Puppet configuration changed in rSPSITE62784f5462586adb44541b6382b41c1863f8938c.
Changes applied to Azure hosts.
Sep 18 2018
Task finished by @olasd .
Sep 17 2018
Sep 13 2018
Sep 12 2018
An Apache instance on pergamon is providing http and/or https services for the following hosts:
- pergamon:8140 (puppet)
Sep 10 2018
All VMs restarted with PCID and NUMA flags.
Sep 6 2018
ns0 VM created on Azure.
Sep 5 2018
All worker VMs on beaubourg restarted with the same settings.
All worker VMs on louvre restarted with NUMA and PCID flags.
They were resized from 16 to 12 GBs of RAM and from 4 to 3 CPU cores in order to waste less hypervisor resources.
numastat output on orsay, for reference:
node0 node1 node2 node3 numa_hit 154258622 106196783 173789251 218914560 numa_miss 0 0 0 0 numa_foreign 0 0 0 0 interleave_hit 6864 6821 6872 6826 local_node 154248017 106178817 173773345 218903410 other_node 10605 17966 15906 11150
numastat output on beaubourg, for reference:
node0 node1 numa_hit 24194141993 34805632693 numa_miss 6528825760 313114704 numa_foreign 313114704 6528825760 interleave_hit 44068 43188 local_node 24194499550 34805370119 other_node 6528468203 313377278
numastat output on louvre, for reference:
node0 node1 node2 node3 numa_hit 13497023257 14081211989 14852512306 17957276918 numa_miss 8599494310 7372048640 2126471863 2510901890 numa_foreign 4832232163 2059616197 4052656329 9664412014 interleave_hit 21033 20998 21022 20991 local_node 13497008321 14081146133 14852428849 17957202370 other_node 8599509246 7372114496 2126555320 2510976437
Sep 4 2018
Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.
Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.
Sep 3 2018
The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.
munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.
Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:
- Most VMs suffer from I/O wait issues since August 20, 2018
- Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
- Ceph nodes suffer from I/O wait
The last point is is not very surprising since their storage mostly consist of rotating disk drives.