general system administration tasks, not specific to any product
Details
Yesterday
Grafana postgresql overview: https://grafana.softwareheritage.org/d/PEKz-Ygiz/postgresql-server-overview
Since the current status of munin's pg monitoring is inconsistent (eg. awure's dbreplica0 has not pg curve), let's get rid of all munin's pg monitors so there is less confusion, and is one step towards T1356.
Wed, Feb 20
After some more stewing and discussion with @zack, we'll be going for the "upgrade to pg 11 and restart replication from scratch" route;
Tue, Feb 19
After reading some mailing list posts discussing the error message, and discussion with @ftigeot:
Logs on primary:
2019-02-19 14:27:44 UTC [15973]: [1-1] user=postgres,db=softwareheritage LOG: starting logical decoding for slot "pgl_softwareheritage_prado_somerset" 2019-02-19 14:27:44 UTC [15973]: [2-1] user=postgres,db=softwareheritage DETAIL: streaming transactions committing after 18607/9189A578, reading WAL from 18607/9189A578 2019-02-19 14:27:44 UTC [15973]: [3-1] user=postgres,db=softwareheritage ERROR: record with incorrect prev-link 5403A/2E2F1829 at 18607/9189A578 2019-02-19 14:27:44 UTC [15973]: [4-1] user=postgres,db=softwareheritage LOG: could not receive data from client: Connection reset by peer
I'm slightly reordering what you wrote here, sorry!
Thanks @olasd for this piece of information.
Mon, Feb 18
Thanks for recording this task; I'll use the opportunity to document the reasoning behind the current internal networking setup, to try and make sure nothing is forgotten before migrating it.
Sat, Feb 16
Heads up.
Fri, Feb 15
Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.
Wed, Feb 13
Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:
auto vmbr0 iface vmbr0 inet static bridge_ports vlan440 address 192.168.100.32 netmask 255.255.255.0 up ip route add 192.168.101.0/24 via 192.168.100.1 up ip route add 192.168.200.0/21 via 192.168.100.1 up ip rule add from 192.168.100.32 table private up ip route add default via 192.168.100.1 dev vmbr0 table private up ip route flush cache down ip route del default via 192.168.100.1 dev vmbr0 table private down ip rule del from 192.168.100.32 table private down ip route del 192.168.200.0/21 via 192.168.100.1 down ip route del 192.168.101.0/24 via 192.168.100.1 down ip route flush cache
Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.
Tue, Feb 12
Louvre had previously fallen more than once. Some of the events are documented in T1173.
Mon, Feb 11
Thu, Feb 7
During the pvmove off of /dev/md3, the root filesystem for uffizi ended up being remounted r/o. I've shut it down, fsck'd it, and booted it back up.
After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:
[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read [ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
The kind of error reported massively and suddenly when louvre stopped operating properly:
Buffer I/O error on device dm-41, logical block 10474329
A brand new virtual disk was created, skipping bad data blocks:
Wed, Feb 6
RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.
dm-11 is a device present on top of dm-10, itself backed by /dev/sda:
More complete list of I/O errors as reported by dmesg(1):
[Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00 [Tue Feb 5 09:38:53 2019] print_req_error: 140 callbacks suppressed [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704 [Tue Feb 5 09:38:53 2019] device-mapper: multipath: Failing path 8:0. [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864 [Tue Feb 5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200 [Tue Feb 5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800 ... [Tue Feb 5 09:39:51 2019] md: super_written gets error=10 [Tue Feb 5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device. md/raid10:md3: Operation continuing on 1 devices. [Tue Feb 5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
Tue, Feb 5
None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).
Forcing a rebuild by removing and re-adding the faulty device:
mdadm --manage /dev/md3 -r /dev/dm-11 mdadm --manage /dev/md3 -a /dev/dm-11
Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.
A full read check of /dev/sda did not return any error:
# dd if=/dev/sda of=/dev/null bs=1M
As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:
Fri, Feb 1
Removing the previously used volume allowed VM migration to complete.
Since the previously used drive is not used anymore, I decided to remove it:
# lvchange -a y ssd/vm-102-disk-0 # vremove /dev/ssd/vm-102-disk-0