From Tuesday 2020-05-05 to Tuesday 2020-05-12, belvedere, our main database server, has seen some pretty severe outages.
The main and secondary postgresql databases have had their underlying storage fill up or lock up at several points.
To the best of our knowledge, some ingestion work has been lost (as some transactions have been rolled back), but this should not have affected the integrity of our data.
This task will serve as a timeline of events and corrective steps taken.
=== Background ===
Before 2020-05-05, the logical volume on which the main database cluster was stored on belvedere was at the maximum of its capacity; The disk usage was around 80%, with a growth of some percentage points per week (which meant that a storage backend migration was in short order).
The storage on belvedere was split between two LVM volume groups:
- a main volume group, inherited from louvre/prado, backed by a set of raid10s on SAS SSDs
- 4 * 1.60 TB, manufactured 2015-W25
- 3 * 3.84 TB, manufactured 2016-W07
- 1 * 3.84 TB, manufactured 2016-W29
- 2 * 3.84 TB, manufactured 2017-W14
- a secondary volume group, made at the point belvedere was bought, out of a raid10 on top of 8 * 1.92 TB SATA SSDs (bought with belvedere, in early 2019).
None of the disks showed any smart errors before or throughout this outage.
=== Timeline of incidents ===
==== Initial out of disk space condition ====
- On 2020-05-05 at around 10:00 UTC, we started the `147.sql` swh.storage migration script to create the new origin_visit_status table. The script was tested on a snapshot of production data in the staging environment, and passed flawlessly. The first implementation of this migration script did a single bulk query to create entries in the new table.
- On 2020-05-05 at 14:00 UTC, we noticed that the main postgresql cluster was close to running out of disk space.
- it had around 600 GB of pending WALs
- it had around 100 GB of pending decoded transactions in the logical replication work directory
- it had around 1.3 TB of free space
PostgreSQL logical replication works by serializing the actions of all transactions on the primary database, before sending them to the replica for application. This means that a long-running transaction will make all concurrent transactions pool up on the primary database, until it completes, and it and all concurrent transactions can be flushed by the replica.
In our workflow, decoding takes up space approximately equivalent to the volume of WALs to be replicated. It was unclear whether we would have enough space to run the replication to completion.
- Around the same time (2020-05-05 14:00 UTC), the workers were shutdown to stop WAL traffic on the main database
- Around 2020-05-05 16:00 UTC, seeing the migration query wasn't close to completion, we canceled it
- 1.1 TB of WALs queued up
- 300 GB of decoded transactions
- ~600 GB of free space
At that point the amount of WALs queued up stayed constant, only the decoded transactions increased, until the replication could catch up.
- At 2020-05-05 20:00 UTC, we added some disk space to the main db cluster volume group, out of the secondary cluster's volume group. At around the same time, the replication caught up.
==== Migration for long-term disk space re-organization ====
The organization of disk space on belvedere could have used some improvements: the main volume group had a single large logical volume which contained the historical //disk image// for the original prado virtual machine. This meant that logical volume resizes involved several steps (resize LV, update GPT, use kpartx to notice the resize, then resize filesystem); It's also unclear whether the partition was properly aligned to 4k sectors. Changing this scheme had been close to the top of the sysadmin todo list for a while, all the more so considering the disk space was close to maxed out.
ZFS has been very stable for us on the main archive storage, and its flexibility for disk management made it a potential choice for the reorganization of this storage. After reading testimonials of others running PostgreSQL on ZFS, notably on the benefits of compression and snapshotting, I decided to try it out.
The migration plan was the following:
- free up the secondary logical volume completely, to make its 8 underlying disks available
- make a temporary zfs pool on top of the 8 SATA disks
- copy data of the secondary database to the temporary zfs pool
- restart the secondary database on zfs; do some tuning of zfs and postgres to match
- copy data of the main database to the temporary zfs pool
- restart the main database on zfs; improve the tuning
==== Migration to temporary zfs ====
- On 2020-05-06 at 07:30 UTC, we started moving files of the secondary database cluster to the filesystem of the main database cluster
- On 2020-05-06 at 08:30 UTC, the copy was complete. We shrunk the main filesystem to remove the overflow logical volume and free up the secondary volume group. A temporary zfs pool was created, using all 8 SATA SSDs in a RAIDZ1 configuration (to get the maximum disk space with the current redundancy constraints); data from the secondary database was copied to that ZFS pool
- On 2020-05-06 at 11:00 UTC, the secondary cluster data copy to ZFS was complete, and the database was restarted on that filesystem; we started copying the data of the main database cluster
- On 2020-05-06 at 12:00 UTC, data from the secondary cluster was removed from the main fs; ingestion was restarted (with main database still on legacy filesystem); data was still being copied from the legacy main fs to ZFS
- On 2020-05-06 at 20:00 UTC, the initial copy of the main cluster data to ZFS was complete; the database kept running on the legacy fs.
==== main database (legacy) filesystem blocked on fdatasync; main database migration to temporary zfs ====
- On 2020-05-07 at 23:00 UTC, the main database (still on its legacy filesystem) locked up.
- On 2020-05-08 at 09:00 UTC, we noticed this issue; All postgresql processes were stuck waiting on some form of fdatasync.
- forcefully stopping PostgreSQL did not work
- mdadm and smart did not report any issue
- the decision to finish the migration to the temporary ZFS pool was taken
- a final copy was run
- On 2020-05-08 at 14:55 UTC, the final copy to ZFS stuck too, and the server was forcefully rebooted
- On 2020-05-08 at 15:45 UTC, the server had rebooted, the raid10s had resynced, and the copy to zfs was resumed
- On 2020-05-08 at 19:21 UTC, the main database server was recovered and restarted on the temporary zfs cluster
==== main database on temporary ZFS ramp up ====
- On 2020-05-08 at 20:00 UTC, a single worker was restarted.
- On 2020-05-09 at 00:31 UTC, the weekly barman backup of the main database cluster started
- On 2020-05-09 at 09:00 UTC, the settings for the zfs ARC were tuned on the database server; Around that time, all workers were restarted.
At that point, both databases were running out of the temporary zfs pool, containing all 8 SATA SSDs in a RAIDZ1 configuration.
==== ZFS/raidz1 on SATA lockup ====
- On 2020-05-10 at 00:24 UTC (second Sunday of the month), the monthly automatic zfs scrub started on the temporary pool
At that point, the ingestion was running at full load (high load on both main and secondary database clusters), and the main database backup was still running.
After 5 minutes, the zfs pool locked up.
dmesg fills up with the following message: `zed[16561]: eid=50 class=deadman pool_guid=0x36D0928681AA2ADB vdev_path=/dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0K201185-part1`
The zfs "deadman" is a watchdog timer that triggers when a read or write operation takes "longer than expected" on one of the underlying devices.
- On 2020-05-10 at 11:30 UTC, the lockup is noticed by a sysadmin.
At that point, tried to set the "deadman policy" to "continue" (instead of the default of wait), in the hope that zfs would timeout the write and fail the disk. That did not work.
- On 2020-05-10 at 12:30 UTC, both postgresql services are prevented from starting, and the server is forcefully rebooted.
==== Final recovery ====
Considering the downtime so far, the decision to finalize the migration to the definitive zfs layout is taken, with database stopped.
- On 2020-05-10 at 15:00 UTC, a bios upgrade and SSD firwmare upgrades for the SAS SSDs are queued up
- On 2020-05-10 at 15:15 UTC, the scrub operation on the temporary raidz1 completes with no errors
- On 2020-05-10 at 15:40 UTC, the RAID10s on the SAS drives are cleared to make space for zfs. Devices are `TRIM`med.
- On 2020-05-10 at 16:00 UTC, the `TRIM` operation locks up (bad interplay with the firmware update?)
- the server is rebooted
- all queued firmware and bios updates are performed
- after rebooting, a new `TRIM` operation is started (and succeeds) on all disks
- On 2020-05-10 at 16:30 UTC, some of the 3.84 TB SSDs are added to the temporary pool. It turns out that it's impossible to remove a top-level RAIDz from a pool, so these SSDs end up "blocked" in the temporary pool; Databases are restarted temporarily so they can replay their WALs and end up with a consistent state on disk.
- On 2020-05-10 at 16:50 UTC, a new (definitive) pool is created with only top-level VDEVs (no mirroring). Data is copied from the temporary pool to this definitive pool (via a zfs snapshot).
- On 2020-05-10 at 20:00 UTC, both databases are being manually backed up to other servers, while the copy to the new pool was running.
- On 2020-05-11 at 16:50 UTC, the database backups had completed. The temporary zfs pool was destroyed, and the SAS disks were reorganized as mirror pairs in the definitive zfs pool.
- On 2020-05-11 at 23:20 UTC, all mirror pairs were finished resilvering. During the resilver, a lot of checksum errors happened on some disks, that were removed from top-level VDEVs before being added as mirror pairs.
- On 2020-05-12 at 07:00 UTC, the final pool was scrubbed to check that all data is in order.
- On 2020-05-12 at 08:30 UTC, the scrub completed with no errors. The checksum errors on the moved disks were cleared.
- On 2020-05-12 at 08:30 UTC, the writing of this post-mortem started.
=== Summary of incidents, remedial steps taken ===
- Out of disk space condition during data migration due to replication system
- Diagnosis:
- legacy disk layout carried over with tight margin
- long-running transactions make our replication system use lots of disk space temporarily
- Remedial:
- better understanding of the interplay of long running transactions and replication and disk space in the team
- reorganization of disks to make storage management easier
- migration to a more advanced filesystem with
- compression (lower disk space pressure)
- (writable) snapshots (better ability to run tests for data migrations)
- live disk replacement (to some extents)
- Lockup of main database on legacy filesystem
- Diagnosis:
- Unclear for now.
- Interplay between several complex storage layers (ext4 on top of GPT on top of LVM on top of RAID10s on top of SAS disks in JBOD mode on a "smart" controller)
- filesystem carried over from several generations of legacy systems (basically, since Software Heritage had an infra)
- plausible block alignment mismatch due to complex legacy layers
- plausible lack of trimming on the SSDs
- Remedial:
- disk firmwares upgraded
- all disks fully trimmed
- simplification of storage layers to a single system
- data copied to new filesystem at the file level
- use of a filesystem with low-level checksumming
- enabled automatic trim on the zfs pool
- Lockup of temporary zfs cluster (raidz1 on SATA disks)
- Diagnosis:
- Unclear;
- High load due to concurrent
- backup
- scrub
- full ingestion load
- We've had other issues with SATA on this (?) server, under high load
- Possible lack of trimming on the SATA disks
- Remedial:
- Data moved away from the SATA disks until further analysis of the disks
- No more raidz1
- Monthly scrub crontab moved away from Sunday (concurrent with backups) to Wednesday
- Possible further remedial:
- less intensive backup strategy (use zfs snapshots instead of rsyncing the live data?)
=== Next steps ===
Now that the storage on belvedere is (hopefully) back in production-ready shape, here are the next steps to restart production:
[x] set mountpoints on the zfs filesystems in the current pool
[x] re-enable postgresql
[x] reboot the server one last time to make sure everything comes back up properly
[ ] make a barman backup once the main postgres server has restarted
[x] re-enable one or two workers to slowly ramp up the load, and clear the backlog of user-facing services
[ ] once the full barman backup is complete, enable other workers
[ ] monitor the load after restarting all workers
There's some background tasks to be running on the zfs pools too:
[ ] Make sure the automatic scrub works
[ ] Add a cronjob to run zpool trim regularly (to complement the autotrim feature)
We also have some decisions to make about the SATA SSDs, which apparently can't take the full load of the archive.
[ ] Run full SMART tests on all disks
[ ] `TRIM` the SATA SSDs
[ ] Decide what to do with them.
They've been running the secondary database cluster for a year, so they work well under low-medium loads; Once we've done basic checks, we can probably add them to the main zfs pool, two at a time, monitoring whether they hold up the load properly.
=== Conclusion ===
It's been quite an eventful week, but we used the "opportunity" of the initial disk space issue and extended downtime to do some long-needed storage reorganization.
There's some more care to be given before we can be confident to run our infra at full load again, but this should be doable over the next few days.
The mismatch between SAS and SATA SSDs is a little concerning. Maybe we should consider moving these SATA SSDs to other, lower-load servers, and get more SAS SSDs for this server (to replace the aging ones). For now, we should keep to running diagnostics tools on these SATA SSDs, before we can consider adding them back to the storage pool.