Page MenuHomeSoftware Heritage

Numerous dm device failures on louvre
Started, Work in Progress, HighPublic

Description

This ticket intended as a post-mortem analysis of a 2019-02-06 incident where many (if not all) dm devices on louvre failed at the same moment, reporting I/O errors.
Louvre had to be hard-rebooted more than once in order to make it operational again.

Event Timeline

ftigeot created this task.Thu, Feb 7, 3:45 PM
ftigeot changed the task status from Open to Work in Progress.
ftigeot triaged this task as High priority.

Related-to: T1486, T1518

ftigeot updated the task description. (Show Details)Thu, Feb 7, 3:48 PM

The kind of error reported massively and suddenly when louvre stopped operating properly:

Buffer I/O error on device dm-41, logical block 10474329

After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:

[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read
[ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
olasd added a subscriber: olasd.Thu, Feb 7, 9:38 PM

During the pvmove off of /dev/md3, the root filesystem for uffizi ended up being remounted r/o. I've shut it down, fsck'd it, and booted it back up.

After the pvmove completed, the errors don't seem to happen again. I did a vgreduce to avoid /dev/md3 altogether.