- User Since
- Sep 6 2017, 1:06 PM (84 w, 4 d)
Fri, Apr 19
Thu, Apr 18
If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.
Tue, Apr 16
Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.
Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.
When I asked where to put such work-in-progress, you suggested the snippets repository.
Mon, Apr 15
Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.
The new hypervisor has been working without any particular issue since its installation.
Wed, Apr 10
Tue, Mar 26
BorgBackup added to the comparison.
Resolved on 2019-02-07.
Resolved on 2019-02-07.
Mar 20 2019
Already marked as done on 2018-12-19.
Mar 15 2019
Looks good to me.
Mar 5 2019
Attachment: comparison between Backuppc and Rsnapshot
Mar 4 2019
Feb 27 2019
Feb 26 2019
Comparison between Backuppc and Rsnapshot done, now adding Restic - https://restic.net/ - to the mix.
Borgbackup not tested yet.
Feb 15 2019
Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.
Feb 13 2019
Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:
auto vmbr0 iface vmbr0 inet static bridge_ports vlan440 address 192.168.100.32 netmask 255.255.255.0 up ip route add 192.168.101.0/24 via 192.168.100.1 up ip route add 192.168.200.0/21 via 192.168.100.1 up ip rule add from 192.168.100.32 table private up ip route add default via 192.168.100.1 dev vmbr0 table private up ip route flush cache down ip route del default via 192.168.100.1 dev vmbr0 table private down ip rule del from 192.168.100.32 table private down ip route del 192.168.200.0/21 via 192.168.100.1 down ip route del 192.168.101.0/24 via 192.168.100.1 down ip route flush cache
Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.
Feb 12 2019
Louvre had previously fallen more than once. Some of the events are documented in T1173.
Feb 7 2019
After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:
[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read [ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
The kind of error reported massively and suddenly when louvre stopped operating properly:
Buffer I/O error on device dm-41, logical block 10474329
A brand new virtual disk was created, skipping bad data blocks:
Feb 6 2019
RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.
dm-11 is a device present on top of dm-10, itself backed by /dev/sda:
More complete list of I/O errors as reported by dmesg(1):
[Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00 [Tue Feb 5 09:38:53 2019] print_req_error: 140 callbacks suppressed [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704 [Tue Feb 5 09:38:53 2019] device-mapper: multipath: Failing path 8:0. [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864 [Tue Feb 5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200 [Tue Feb 5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800 ... [Tue Feb 5 09:39:51 2019] md: super_written gets error=10 [Tue Feb 5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device. md/raid10:md3: Operation continuing on 1 devices. [Tue Feb 5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
Feb 5 2019
None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).
Forcing a rebuild by removing and re-adding the faulty device:
mdadm --manage /dev/md3 -r /dev/dm-11 mdadm --manage /dev/md3 -a /dev/dm-11
Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.
A full read check of /dev/sda did not return any error:
# dd if=/dev/sda of=/dev/null bs=1M
As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:
Feb 1 2019
Removing the previously used volume allowed VM migration to complete.
Since the previously used drive is not used anymore, I decided to remove it:
# lvchange -a y ssd/vm-102-disk-0 # vremove /dev/ssd/vm-102-disk-0
The previous drive is neither active nor opened:
The main VM disk is stored on a "vm-102-disk-1" volume (on Ceph)
There is an inactive lvm volume on "beaubourg-ssd" formerly associated with this VM, it was used as the virtual disk backend before the virtual disk device was migrated to Ceph.
No mention of "beaubourg-ssd" is visible in the Proxmox virtual machine management interface.
All virtual disk backends are stored on Ceph.
Jan 31 2019
Trying to manually disable the logical volume in question fails with the same error message
lvchange -a n /dev/ssd/vm-107-disk-0 Logical volume ssd/vm-107-disk-0 is used by another device.
Jan 30 2019
Only keep 24 hours of log, and keep rotating on the same file names:
There is no need to log all production queries on this server.
Reducing logged contents to queries taking more than one millisecond to execute:
It turns out hypervisor3 is not the culprit we thought it was.
Removing T1392 from parent task list.
Jan 29 2019
Jan 25 2019
After running some additional tcp iperf tests, it is obvious beaubourg is the outlier.
Measured bandwidth :
- from any 10G machine to any 10G machine (except beaubourg): > 9 Gb/s
- from any 10G machine to beaubourg: > 9Gb/s
- from beaubourg to ceph-osd1, ceph-osd2 and hypervisor3: 600-800 Mb/s
- from beaubourg to ceph-mon1: 230 Kb/s
Jan 22 2019
Since all these machines are relied to the same pair of switches and these switches are managed by INRIA DSI-SESI, I have asked for their assistance in this ticket:
The /dev/md3 check completed successfully and did not report any error.
Jan 21 2019
worker06.internal.softwareheritage.org is a VM running on louvre, Its virtual disk is backed by /dev/dm-36 on the host.
Jan 16 2019
For the previous iperf TCP test and without tuning, we also have:
- an average transfer speed of 9,388 Mb/s between hypervisor3 and one of the 10G Ceph nodes, ceph-osd1.
- an average rransfer speed of 8,364 Mb/s between beaubourg and ceph-osd1.
Jan 15 2019
Both beaubourg and hypervisor3 network interfaces have a 10Gb/s link layer connection.
Aggregated traffic from multiple iperf streams nevertheless never reaches more than ~= 90% of a 1Gb/s transfer speed.
Another thing worth noting is the vmbr0 interface on which the primary IP address is located, has a mtu of only 1500 bytes.
The network interfaces it is built on have a 9000 bytes mtu.
iperf tests show
- network speed never reaches 1Gbps, even between hosts which have 10Gb/s network interfaces and are connected to the same switches
- 19% of UDP packets get lost at 1Gb/s (less than 0.5% at 100Mb/s)
Jan 14 2019
Corosync warnings also routinely appear in the logs:
Jan 14 11:56:13 hypervisor3 corosync: notice [TOTEM ] Retransmit List: 282eb9 Jan 14 11:56:13 hypervisor3 corosync: [TOTEM ] Retransmit List: 282eb9 Jan 14 11:56:13 hypervisor3 corosync: [TOTEM ] Retransmit List: 282eba
The network interface hardware on hypervisor3 is relatively new:
i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
Jan 11 2019
Dec 21 2018
Two 4TB SSDs added to banco yesterday, exported to Linux as JBODs.
Proxmox now installed on the machine, hypervisor3.softwareheritage.org.
Dec 13 2018
Dec 11 2018
Another Perc H700 battery replacement product: http://www.hardware-attitude.com/fiche-1114-batterie-raid-pour-perc5-i-perc6-i---nu209.html
We should buy this one if possible ASAP IMHO.
Dec 7 2018
Borgbackup is unable to pull data from remote hosts to a central location.
I do not understand this assertion.
Dec 4 2018
- I/Os per device
- Disk usage in percent
- Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
- Disk usage in absolute human values. percentages are meaningless if we resize filesystems