Trying to manually disable the logical volume in question fails with the same error message
lvchange -a n /dev/ssd/vm-107-disk-0 Logical volume ssd/vm-107-disk-0 is used by another device.
Trying to manually disable the logical volume in question fails with the same error message
lvchange -a n /dev/ssd/vm-107-disk-0 Logical volume ssd/vm-107-disk-0 is used by another device.
Only keep 24 hours of log, and keep rotating on the same file names:
There is no need to log all production queries on this server.
Reducing logged contents to queries taking more than one millisecond to execute:
It turns out hypervisor3 is not the culprit we thought it was.
Removing T1392 from parent task list.
After running some additional tcp iperf tests, it is obvious beaubourg is the outlier.
Measured bandwidth :
Since all these machines are relied to the same pair of switches and these switches are managed by INRIA DSI-SESI, I have asked for their assistance in this ticket:
https://support.inria.fr/Ticket/Display.html?id=127011
The /dev/md3 check completed successfully and did not report any error.
worker06.internal.softwareheritage.org is a VM running on louvre, Its virtual disk is backed by /dev/dm-36 on the host.
For the previous iperf TCP test and without tuning, we also have:
Both beaubourg and hypervisor3 network interfaces have a 10Gb/s link layer connection.
Aggregated traffic from multiple iperf streams nevertheless never reaches more than ~= 90% of a 1Gb/s transfer speed.
Another thing worth noting is the vmbr0 interface on which the primary IP address is located, has a mtu of only 1500 bytes.
The network interfaces it is built on have a 9000 bytes mtu.
iperf tests show
Corosync warnings also routinely appear in the logs:
Jan 14 11:56:13 hypervisor3 corosync[5622]: notice [TOTEM ] Retransmit List: 282eb9 Jan 14 11:56:13 hypervisor3 corosync[5622]: [TOTEM ] Retransmit List: 282eb9 Jan 14 11:56:13 hypervisor3 corosync[5622]: [TOTEM ] Retransmit List: 282eba
The network interface hardware on hypervisor3 is relatively new:
i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
Two 4TB SSDs added to banco yesterday, exported to Linux as JBODs.
Proxmox now installed on the machine, hypervisor3.softwareheritage.org.
Another Perc H700 battery replacement product: http://www.hardware-attitude.com/fiche-1114-batterie-raid-pour-perc5-i-perc6-i---nu209.html
We should buy this one if possible ASAP IMHO.
Borgbackup is unable to pull data from remote hosts to a central location.
I do not understand this assertion.
There is a huge difference between Borgbackup and Rsnapshot + Backuppc: Borgbackup is unable to pull data from remote hosts to a central location.
Its working model is based on Borgbackup running locally and storing data to a local filesystem.
New hypervisor hardware has been racked in our bay at Rocquencourt.
The machine's iDrac management interface is accessible on the management network, under the name swh7-adm.inria.fr (details on the wiki).
Service postgresql@10-indexer.service has been restarted on somerset and database replication is once again operating normally.
Postgres wal files are being removed as expected on the master, slowly freeing disk space.
Some no longer useful dump files were removed by seirl@, freeing some space on somerset:/srv/softwareheritage/postgres .
somerset:softwareheritage-indexer is the master database for dbreplica1:softwareheritage-indexer.
The pvmove command was done this morning.
At least some of the batteries for PERC H800 adapters use part number KR174 and/or M164C.
Some information leads me to believe they could also be used with PERC H700 adapters.
I did some experiments with Letsencrypt but other things were more urgent during the September-October 2018 period and in the end a wildcard Digicert certificate was used again instead.
Batteries for PERC H700 adapters have the part number U8735 and/or NU209.
Orsay contains two LSI SAS 2108-based RAID adapters:
05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 22:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
Since the SSDs we have are 2.5", we need a special adapter disk tray, which Dell refuses to sell us.
Related: T1323
In summary, only orsay has a failed BBU.
Given the fact that it contains two identical RAID adapters with old-age, similar BBUs, it could be useful to change both at once.
List of physical machines at Rocquencourt: louvre beaubourg orsay banco
The existing dar(1) based system is not reliable.
Icinga2 service monitoring changes pushed in commit rSPSITE76d7d90c51e0, based on the initial script linked by olasd@.
Elasticsearch, Logstash and Kibana are now released together and similar versions are sure to be compatible. It makes sense to have a global Puppet constant defining which general elk stack version to use for packages.
A quick analysis of the 6.4.x family versions show they bring significant bug fixes to the table.
One particularly interesting aspect is general cluster reliability improvements when nodes leaves or come back to the cluster.
Upgrading the Elasticsearch cluster is a somewhat delicate operation since nodes running old Elasticsearch versions can no longer store new data but it is not really difficult to handle properly.
The biggest issue could be with Kibana / Elasticsearch interactions: some old Kibana versions are known to stop displaying dashboards when talking to newer Elasticsearch servers.
Correct me if I am wrong, but I do not believe the current Puppet code has the ability to handle more than one NS record per zone.
At the very least, I couldn't find an obvious way to add such a record.
All known SSL services now use updated certificates. Closing.
www and www-dev.softwareheritage.org now use auto-generated Gandi certificates.
Updated certificate uploaded to the Puppet repository and internal hosts updated.
PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.
Existing CSR data submitted again today to the secret INRIA/Digicert URL.
Right now, the internal.softwareheritage.org zone contains only a single NS record. This is most likely also the case for the various reverse zones.
There is no explicit notification directive in the master server configuration either.