The /dev/md3 RAID volume on louvre failed this morning.
Volume composition [cat /proc/mdstat]:
md3 : active raid10 dm-11[0](F) dm-13[1] 3750605824 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U] bitmap: 22/28 pages [88KB], 65536KB chunk
The /dev/md3 RAID volume on louvre failed this morning.
Volume composition [cat /proc/mdstat]:
md3 : active raid10 dm-11[0](F) dm-13[1] 3750605824 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U] bitmap: 22/28 pages [88KB], 65536KB chunk
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1520 Numerous dm device failures on louvre | ||
Migrated | gitlab-migration | T1486 I/O error on worker06.internal | ||
Migrated | gitlab-migration | T1518 I/O error on louvre:/dev/md3 |
As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:
# lsblk sda 8:0 0 3.5T 0 disk ├─sda1 8:1 0 3.5T 0 part └─ssd-slot2 253:10 0 3.5T 0 mpath └─ssd-slot2-part1 253:11 0 3.5T 0 part └─md3 9:3 0 3.5T 0 raid10 sdb 8:16 0 3.5T 0 disk ├─sdb1 8:17 0 3.5T 0 part └─ssd-slot3 253:12 0 3.5T 0 mpath └─ssd-slot3-part1 253:13 0 3.5T 0 part └─md3 9:3 0 3.5T 0 raid10
A full read check of /dev/sda did not return any error:
# dd if=/dev/sda of=/dev/null bs=1M
Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.
Forcing a rebuild by removing and re-adding the faulty device:
mdadm --manage /dev/md3 -r /dev/dm-11 mdadm --manage /dev/md3 -a /dev/dm-11
More complete list of I/O errors as reported by dmesg(1):
[Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [Tue Feb 5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00 [Tue Feb 5 09:38:53 2019] print_req_error: 140 callbacks suppressed [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704 [Tue Feb 5 09:38:53 2019] device-mapper: multipath: Failing path 8:0. [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032 [Tue Feb 5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864 [Tue Feb 5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600 [Tue Feb 5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200 [Tue Feb 5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800 ... [Tue Feb 5 09:39:51 2019] md: super_written gets error=10 [Tue Feb 5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device. md/raid10:md3: Operation continuing on 1 devices. [Tue Feb 5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
dm-11 is a device present on top of dm-10, itself backed by /dev/sda:
sda 8:0 0 3.5T 0 disk ├─sda1 8:1 0 3.5T 0 part └─ssd-slot2 253:10 0 3.5T 0 mpath └─ssd-slot2-part1 253:11 0 3.5T 0 part
(ssd-slot2 is dm10 and ssd-slot2-part1 is dm-11)
RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.