Page MenuHomeSoftware Heritage

I/O error on louvre:/dev/md3
Closed, ResolvedPublic

Description

The /dev/md3 RAID volume on louvre failed this morning.

Volume composition [cat /proc/mdstat]:

md3 : active raid10 dm-11[0](F) dm-13[1]
      3750605824 blocks super 1.2 512K chunks 2 far-copies [2/1] [_U]
      bitmap: 22/28 pages [88KB], 65536KB chunk

Event Timeline

ftigeot created this task.Feb 5 2019, 3:36 PM
ftigeot triaged this task as High priority.

As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:

# lsblk
sda                                           8:0    0   3.5T  0 disk
├─sda1                                        8:1    0   3.5T  0 part
└─ssd-slot2                                 253:10   0   3.5T  0 mpath
  └─ssd-slot2-part1                         253:11   0   3.5T  0 part
    └─md3                                     9:3    0   3.5T  0 raid10
sdb                                           8:16   0   3.5T  0 disk
├─sdb1                                        8:17   0   3.5T  0 part
└─ssd-slot3                                 253:12   0   3.5T  0 mpath
  └─ssd-slot3-part1                         253:13   0   3.5T  0 part
    └─md3                                     9:3    0   3.5T  0 raid10

A full read check of /dev/sda did not return any error:

# dd if=/dev/sda of=/dev/null bs=1M

Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.

ftigeot changed the task status from Open to Work in Progress.Feb 5 2019, 4:18 PM

Forcing a rebuild by removing and re-adding the faulty device:

mdadm --manage /dev/md3 -r /dev/dm-11
mdadm --manage /dev/md3 -a /dev/dm-11

More complete list of I/O errors as reported by dmesg(1):

[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[Tue Feb  5 09:38:53 2019] print_req_error: 140 callbacks suppressed
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704
[Tue Feb  5 09:38:53 2019] device-mapper: multipath: Failing path 8:0.
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864
[Tue Feb  5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200
[Tue Feb  5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800
...
[Tue Feb  5 09:39:51 2019] md: super_written gets error=10
[Tue Feb  5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device.
                           md/raid10:md3: Operation continuing on 1 devices.
[Tue Feb  5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror

dm-11 is a device present on top of dm-10, itself backed by /dev/sda:

sda                                           8:0    0   3.5T  0 disk
├─sda1                                        8:1    0   3.5T  0 part
└─ssd-slot2                                 253:10   0   3.5T  0 mpath
  └─ssd-slot2-part1                         253:11   0   3.5T  0 part

(ssd-slot2 is dm10 and ssd-slot2-part1 is dm-11)

ftigeot closed this task as Resolved.Feb 6 2019, 10:43 AM

RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.