Page MenuHomeSoftware Heritage

Numerous dm device failures on louvre
Closed, MigratedEdits Locked

Description

This ticket intended as a post-mortem analysis of a 2019-02-06 incident where many (if not all) dm devices on louvre failed at the same moment, reporting I/O errors.
Louvre had to be hard-rebooted more than once in order to make it operational again.

Event Timeline

ftigeot changed the task status from Open to Work in Progress.Feb 7 2019, 3:45 PM
ftigeot triaged this task as High priority.
ftigeot created this task.

Related-to: T1486, T1518

The kind of error reported massively and suddenly when louvre stopped operating properly:

Buffer I/O error on device dm-41, logical block 10474329

After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:

[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read
[ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read

During the pvmove off of /dev/md3, the root filesystem for uffizi ended up being remounted r/o. I've shut it down, fsck'd it, and booted it back up.

After the pvmove completed, the errors don't seem to happen again. I did a vgreduce to avoid /dev/md3 altogether.

Resolved on 2019-02-07.

gitlab-migration changed the task status from Resolved to Migrated.Oct 19 2022, 5:55 PM
gitlab-migration claimed this task.
gitlab-migration changed the status of subtask T1486: I/O error on worker06.internal from Resolved to Migrated.
gitlab-migration added a subscriber: gitlab-migration.