This ticket intended as a post-mortem analysis of a 2019-02-06 incident where many (if not all) dm devices on louvre failed at the same moment, reporting I/O errors.
Louvre had to be hard-rebooted more than once in order to make it operational again.
Description
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1520 Numerous dm device failures on louvre | ||
Migrated | gitlab-migration | T1486 I/O error on worker06.internal | ||
Migrated | gitlab-migration | T1518 I/O error on louvre:/dev/md3 |
Event Timeline
Comment Actions
The kind of error reported massively and suddenly when louvre stopped operating properly:
Buffer I/O error on device dm-41, logical block 10474329
Comment Actions
After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:
[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read [ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
Comment Actions
During the pvmove off of /dev/md3, the root filesystem for uffizi ended up being remounted r/o. I've shut it down, fsck'd it, and booted it back up.
After the pvmove completed, the errors don't seem to happen again. I did a vgreduce to avoid /dev/md3 altogether.