- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 28 2019
Jun 27 2019
Jun 25 2019
Jun 24 2019
The solution to this problem is to first identify the partition devices and then remove them:
dmsetup remove ssd-vm--100--disk--2p1
This behavior appears to be caused by partitions present on top of device mappers devices.
These partitions in turn are used to create other dm devices and these latest device keep an open reference to the base one.
Jun 13 2019
Jun 11 2019
Jun 6 2019
The reason of this behavior is Debian uses dynamic UIDs for most of its system users.
Looks good for a first draft.
May 29 2019
May 28 2019
Looks good to me.
Always using the fqdn belvedere.internal.softwareheritage.org would be more consistent though ;-)
May 22 2019
May 16 2019
May 14 2019
We will use VMs running on the orsay.softwareinternal.org hypervisor for now.
May 13 2019
Apr 30 2019
Grafanalib dashboards added to https://grafana.softwareheritage.org/ via the new provisioning mechanism of Grafana 5.x.
Fully automated provisioning is still a work-in-progress.
Prometheus does not provide storage device statistics for Proxmox container-based hosts.
The data can be read from their parent machine dashboards though.
Some disk space usage statistics with ~= one month of snapshots
Apr 25 2019
Grafanalib based dashboards do not require special handling, the nfs filesystem on orangerie:/srv/softwareheritage is shown by default for example.
Apr 19 2019
Apr 18 2019
If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.
Apr 16 2019
Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.
Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.
When I asked where to put such work-in-progress, you suggested the snippets repository.
Apr 15 2019
Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.
The new hypervisor has been working without any particular issue since its installation.
Apr 10 2019
Mar 26 2019
BorgBackup added to the comparison.
Resolved on 2019-02-07.
Resolved on 2019-02-07.
Mar 20 2019
Already marked as done on 2018-12-19.
Mar 15 2019
Looks good to me.
Mar 5 2019
Attachment: comparison between Backuppc and Rsnapshot
Mar 4 2019
Feb 27 2019
Feb 26 2019
Comparison between Backuppc and Rsnapshot done, now adding Restic - https://restic.net/ - to the mix.
Borgbackup not tested yet.
Feb 15 2019
Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.
Feb 13 2019
Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:
auto vmbr0 iface vmbr0 inet static bridge_ports vlan440 address 192.168.100.32 netmask 255.255.255.0 up ip route add 192.168.101.0/24 via 192.168.100.1 up ip route add 192.168.200.0/21 via 192.168.100.1 up ip rule add from 192.168.100.32 table private up ip route add default via 192.168.100.1 dev vmbr0 table private up ip route flush cache down ip route del default via 192.168.100.1 dev vmbr0 table private down ip rule del from 192.168.100.32 table private down ip route del 192.168.200.0/21 via 192.168.100.1 down ip route del 192.168.101.0/24 via 192.168.100.1 down ip route flush cache
Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.
Feb 12 2019
Louvre had previously fallen more than once. Some of the events are documented in T1173.
Feb 7 2019
After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:
[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read [ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
The kind of error reported massively and suddenly when louvre stopped operating properly:
Buffer I/O error on device dm-41, logical block 10474329
A brand new virtual disk was created, skipping bad data blocks:
Feb 6 2019
RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.
dm-11 is a device present on top of dm-10, itself backed by /dev/sda:
More complete list of I/O errors as reported by dmesg(1):
[Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00 [Tue Feb 5 09:38:53 2019] print_req_error: 140 callbacks suppressed [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704 [Tue Feb 5 09:38:53 2019] device-mapper: multipath: Failing path 8:0. [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864 [Tue Feb 5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200 [Tue Feb 5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800 ... [Tue Feb 5 09:39:51 2019] md: super_written gets error=10 [Tue Feb 5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device. md/raid10:md3: Operation continuing on 1 devices. [Tue Feb 5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
Feb 5 2019
None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).
Forcing a rebuild by removing and re-adding the faulty device:
mdadm --manage /dev/md3 -r /dev/dm-11 mdadm --manage /dev/md3 -a /dev/dm-11
Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.
A full read check of /dev/sda did not return any error:
# dd if=/dev/sda of=/dev/null bs=1M
As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:
Feb 1 2019
Removing the previously used volume allowed VM migration to complete.
Since the previously used drive is not used anymore, I decided to remove it:
# lvchange -a y ssd/vm-102-disk-0 # vremove /dev/ssd/vm-102-disk-0
The previous drive is neither active nor opened:
The main VM disk is stored on a "vm-102-disk-1" volume (on Ceph)
There is an inactive lvm volume on "beaubourg-ssd" formerly associated with this VM, it was used as the virtual disk backend before the virtual disk device was migrated to Ceph.
No mention of "beaubourg-ssd" is visible in the Proxmox virtual machine management interface.
All virtual disk backends are stored on Ceph.