Page MenuHomeSoftware Heritage

post-mortem on the 2020-05 outages of belvedere
Closed, MigratedEdits Locked

Description

From Tuesday 2020-05-05 to Tuesday 2020-05-12, belvedere, our main database server, has seen some pretty severe outages.

The main and secondary postgresql databases have had their underlying storage fill up or lock up at several points.

To the best of our knowledge, some ingestion work has been lost (as some transactions have been rolled back), but this should not have affected the integrity of our data.

This task will serve as a timeline of events and corrective steps taken.

Background

Before 2020-05-05, the logical volume on which the main database cluster was stored on belvedere was at the maximum of its capacity; The disk usage was around 80%, with a growth of some percentage points per week (which meant that a storage backend migration was in short order).

The storage on belvedere was split between two LVM volume groups:

  • a main volume group, inherited from louvre/prado, backed by a set of raid10s on SAS SSDs
    • 4 * 1.60 TB, manufactured 2015-W25
    • 3 * 3.84 TB, manufactured 2016-W07
    • 1 * 3.84 TB, manufactured 2016-W29
    • 2 * 3.84 TB, manufactured 2017-W14
  • a secondary volume group, made at the point belvedere was bought, out of a raid10 on top of 8 * 1.92 TB SATA SSDs (bought with belvedere, in early 2019).

None of the disks showed any smart errors before or throughout this outage.

Timeline of incidents

Initial out of disk space condition
  • On 2020-05-05 at around 10:00 UTC, we started the 147.sql swh.storage migration script to create the new origin_visit_status table. The script was tested on a snapshot of production data in the staging environment, and passed flawlessly. The first implementation of this migration script did a single bulk query to create entries in the new table.
  • On 2020-05-05 at 14:00 UTC, we noticed that the main postgresql cluster was close to running out of disk space.
    • it had around 600 GB of pending WALs
    • it had around 100 GB of pending decoded transactions in the logical replication work directory
    • it had around 1.3 TB of free space

PostgreSQL logical replication works by serializing the actions of all transactions on the primary database, before sending them to the replica for application. This means that a long-running transaction will make all concurrent transactions pool up on the primary database, until it completes, and it and all concurrent transactions can be flushed by the replica.

In our workflow, decoding takes up space approximately equivalent to the volume of WALs to be replicated. It was unclear whether we would have enough space to run the replication to completion.

  • Around the same time (2020-05-05 14:00 UTC), the workers were shutdown to stop WAL traffic on the main database
  • Around 2020-05-05 16:00 UTC, seeing the migration query wasn't close to completion, we canceled it
    • 1.1 TB of WALs queued up
    • 300 GB of decoded transactions
    • ~600 GB of free space

At that point the amount of WALs queued up stayed constant, only the decoded transactions increased, until the replication could catch up.

  • At 2020-05-05 20:00 UTC, we added some disk space to the main db cluster volume group, out of the secondary cluster's volume group. At around the same time, the replication caught up.
Migration for long-term disk space re-organization

The organization of disk space on belvedere could have used some improvements: the main volume group had a single large logical volume which contained the historical disk image for the original prado virtual machine. This meant that logical volume resizes involved several steps (resize LV, update GPT, use kpartx to notice the resize, then resize filesystem); It's also unclear whether the partition was properly aligned to 4k sectors. Changing this scheme had been close to the top of the sysadmin todo list for a while, all the more so considering the disk space was close to maxed out.

ZFS has been very stable for us on the main archive storage, and its flexibility for disk management made it a potential choice for the reorganization of this storage. After reading testimonials of others running PostgreSQL on ZFS, notably on the benefits of compression and snapshotting, I decided to try it out.

The migration plan was the following:

  • free up the secondary logical volume completely, to make its 8 underlying disks available
  • make a temporary zfs pool on top of the 8 SATA disks
  • copy data of the secondary database to the temporary zfs pool
  • restart the secondary database on zfs; do some tuning of zfs and postgres to match
  • copy data of the main database to the temporary zfs pool
  • restart the main database on zfs; improve the tuning
Migration to temporary zfs
  • On 2020-05-06 at 07:30 UTC, we started moving files of the secondary database cluster to the filesystem of the main database cluster
  • On 2020-05-06 at 08:30 UTC, the copy was complete. We shrunk the main filesystem to remove the overflow logical volume and free up the secondary volume group. A temporary zfs pool was created, using all 8 SATA SSDs in a RAIDZ1 configuration (to get the maximum disk space with the current redundancy constraints); data from the secondary database was copied to that ZFS pool
  • On 2020-05-06 at 11:00 UTC, the secondary cluster data copy to ZFS was complete, and the database was restarted on that filesystem; we started copying the data of the main database cluster
  • On 2020-05-06 at 12:00 UTC, data from the secondary cluster was removed from the main fs; ingestion was restarted (with main database still on legacy filesystem); data was still being copied from the legacy main fs to ZFS
  • On 2020-05-06 at 20:00 UTC, the initial copy of the main cluster data to ZFS was complete; the database kept running on the legacy fs.
main database (legacy) filesystem blocked on fdatasync; main database migration to temporary zfs
  • On 2020-05-07 at 23:00 UTC, the main database (still on its legacy filesystem) locked up.
  • On 2020-05-08 at 09:00 UTC, we noticed this issue; All postgresql processes were stuck waiting on some form of fdatasync.
    • forcefully stopping PostgreSQL did not work
    • mdadm and smart did not report any issue
    • the decision to finish the migration to the temporary ZFS pool was taken
    • a final copy was run
  • On 2020-05-08 at 14:55 UTC, the final copy to ZFS stuck too, and the server was forcefully rebooted
  • On 2020-05-08 at 15:45 UTC, the server had rebooted, the raid10s had resynced, and the copy to zfs was resumed
  • On 2020-05-08 at 19:21 UTC, the main database server was recovered and restarted on the temporary zfs cluster
main database on temporary ZFS ramp up
  • On 2020-05-08 at 20:00 UTC, a single worker was restarted.
  • On 2020-05-09 at 00:31 UTC, the weekly barman backup of the main database cluster started
  • On 2020-05-09 at 09:00 UTC, the settings for the zfs ARC were tuned on the database server; Around that time, all workers were restarted.

At that point, both databases were running out of the temporary zfs pool, containing all 8 SATA SSDs in a RAIDZ1 configuration.

ZFS/raidz1 on SATA lockup
  • On 2020-05-10 at 00:24 UTC (second Sunday of the month), the monthly automatic zfs scrub started on the temporary pool

At that point, the ingestion was running at full load (high load on both main and secondary database clusters), and the main database backup was still running.

After 5 minutes, the zfs pool locked up.

dmesg fills up with the following message: zed[16561]: eid=50 class=deadman pool_guid=0x36D0928681AA2ADB vdev_path=/dev/disk/by-id/ata-MZ7KM1T9HMJP0D3_S3BRNX0K201185-part1

The zfs "deadman" is a watchdog timer that triggers when a read or write operation takes "longer than expected" on one of the underlying devices.

  • On 2020-05-10 at 11:30 UTC, the lockup is noticed by a sysadmin.

At that point, tried to set the "deadman policy" to "continue" (instead of the default of wait), in the hope that zfs would timeout the write and fail the disk. That did not work.

  • On 2020-05-10 at 12:30 UTC, both postgresql services are prevented from starting, and the server is forcefully rebooted.
Final recovery

Considering the downtime so far, the decision to finalize the migration to the definitive zfs layout is taken, with database stopped.

  • On 2020-05-10 at 15:00 UTC, a bios upgrade and SSD firwmare upgrades for the SAS SSDs are queued up
  • On 2020-05-10 at 15:15 UTC, the scrub operation on the temporary raidz1 completes with no errors
  • On 2020-05-10 at 15:40 UTC, the RAID10s on the SAS drives are cleared to make space for zfs. Devices are TRIMmed.
  • On 2020-05-10 at 16:00 UTC, the TRIM operation locks up (bad interplay with the firmware update?)
    • the server is rebooted
    • all queued firmware and bios updates are performed
    • after rebooting, a new TRIM operation is started (and succeeds) on all disks
  • On 2020-05-10 at 16:30 UTC, some of the 3.84 TB SSDs are added to the temporary pool. It turns out that it's impossible to remove a top-level RAIDz from a pool, so these SSDs end up "blocked" in the temporary pool; Databases are restarted temporarily so they can replay their WALs and end up with a consistent state on disk.
  • On 2020-05-10 at 16:50 UTC, a new (definitive) pool is created with only top-level VDEVs (no mirroring). Data is copied from the temporary pool to this definitive pool (via a zfs snapshot).
  • On 2020-05-10 at 20:00 UTC, both databases are being manually backed up to other servers, while the copy to the new pool was running.
  • On 2020-05-11 at 16:50 UTC, the database backups had completed. The temporary zfs pool was destroyed, and the SAS disks were reorganized as mirror pairs in the definitive zfs pool.
  • On 2020-05-11 at 23:20 UTC, all mirror pairs were finished resilvering. During the resilver, a lot of checksum errors happened on some disks, that were removed from top-level VDEVs before being added as mirror pairs.
  • On 2020-05-12 at 07:00 UTC, the final pool was scrubbed to check that all data is in order.
  • On 2020-05-12 at 08:30 UTC, the scrub completed with no errors. The checksum errors on the moved disks were cleared.
  • On 2020-05-12 at 08:30 UTC, the writing of this post-mortem started.

Summary of incidents, remedial steps taken

  • Out of disk space condition during data migration due to replication system
    • Diagnosis:
      • legacy disk layout carried over with tight margin
      • long-running transactions make our replication system use lots of disk space temporarily
    • Remedial:
      • better understanding of the interplay of long running transactions and replication and disk space in the team
      • reorganization of disks to make storage management easier
      • migration to a more advanced filesystem with
        • compression (lower disk space pressure)
        • (writable) snapshots (better ability to run tests for data migrations)
        • live disk replacement (to some extents)
  • Lockup of main database on legacy filesystem
    • Diagnosis:
      • Unclear for now.
      • Interplay between several complex storage layers (ext4 on top of GPT on top of LVM on top of RAID10s on top of SAS disks in JBOD mode on a "smart" controller)
      • filesystem carried over from several generations of legacy systems (basically, since Software Heritage had an infra)
      • plausible block alignment mismatch due to complex legacy layers
      • plausible lack of trimming on the SSDs
    • Remedial:
      • disk firmwares upgraded
      • all disks fully trimmed
      • simplification of storage layers to a single system
      • data copied to new filesystem at the file level
      • use of a filesystem with low-level checksumming
      • enabled automatic trim on the zfs pool
  • Lockup of temporary zfs cluster (raidz1 on SATA disks)
    • Diagnosis:
      • Unclear;
      • High load due to concurrent
        • backup
        • scrub
        • full ingestion load
      • We've had other issues with SATA on this (?) server, under high load
      • Possible lack of trimming on the SATA disks
    • Remedial:
      • Data moved away from the SATA disks until further analysis of the disks
      • No more raidz1
      • Monthly scrub crontab moved away from Sunday (concurrent with backups) to Wednesday
    • Possible further remedial:
      • less intensive backup strategy (use zfs snapshots instead of rsyncing the live data?)

Next steps

Now that the storage on belvedere is (hopefully) back in production-ready shape, here are the next steps to restart production:

  • set mountpoints on the zfs filesystems in the current pool
  • re-enable postgresql
  • reboot the server one last time to make sure everything comes back up properly
  • make a barman backup once the main postgres server has restarted
  • re-enable one or two workers to slowly ramp up the load, and clear the backlog of user-facing services
  • once the full barman backup is complete, enable other workers
  • monitor the load after restarting all workers

There's some background tasks to be running on the zfs pools too:

  • Make sure the automatic scrub works
  • Add a cronjob to run zpool trim regularly (to complement the autotrim feature)

We also have some decisions to make about the SATA SSDs, which apparently can't take the full load of the archive.

  • Run full SMART tests on all disks
  • TRIM the SATA SSDs
  • Decide what to do with them.

They've been running the secondary database cluster for a year, so they work well under low-medium loads; Once we've done basic checks, we can probably add them to the main zfs pool, two at a time, monitoring whether they hold up the load properly.

Conclusion

It's been quite an eventful week, but we used the "opportunity" of the initial disk space issue and extended downtime to do some long-needed storage reorganization.

There's some more care to be given before we can be confident to run our infra at full load again, but this should be doable over the next few days.

The mismatch between SAS and SATA SSDs is a little concerning. Maybe we should consider moving these SATA SSDs to other, lower-load servers, and get more SAS SSDs for this server (to replace the aging ones). For now, we should keep to running diagnostics tools on these SATA SSDs, before we can consider adding them back to the storage pool.

Event Timeline

olasd changed the task status from Open to Work in Progress.May 12 2020, 1:02 PM
olasd triaged this task as Unbreak Now! priority.
olasd created this task.

These have happened now:

  • set mountpoints on the zfs filesystems in the current pool
  • re-enable postgresql
  • reboot the server one last time to make sure everything comes back up properly
  • re-enable one or two workers to slowly ramp up the load, and clear the backlog of user-facing services

The worker has been stopped again because, after a reboot of uffizi, the zfs mounts didn't come up, and swh-storage filled up the root partition instead... This has been mitigated now (pending another uffizi reboot once the objects have been moved over).

And this is now in progress (started today at 08:37 UTC):

  • make a barman backup once the main postgres server has restarted
In T2403#44381, @olasd wrote:
  • re-enable one or two workers to slowly ramp up the load, and clear the backlog of user-facing services

The worker has been stopped again because, after a reboot of uffizi, the zfs mounts didn't come up, and swh-storage filled up the root partition instead... This has been mitigated now (pending another uffizi reboot once the objects have been moved over).

The uffizi issues have been solved. worker01 has been restarted and uffizi / belvedere are being monitored

And this is now in progress (started today at 08:37 UTC):

  • make a barman backup once the main postgres server has restarted

This is still running, and it's maxing out the 1Gbps link to the backup server. ETA is supposedly 4.5TiB, which means around 10 hours.

  • make a barman backup once the main postgres server has restarted

The backup has now completed (in 12 hours).

  • once the full barman backup is complete, enable other workers

I had enabled worker01 and worker02 before the backup completed. That had taken care of the backlog of save code now. I'll now enable worker03 and worker04, as well as worker09-12 (on the other hypervisor).

olasd lowered the priority of this task from Unbreak Now! to Normal.May 15 2020, 12:33 PM
olasd updated the task description. (Show Details)
  • once the full barman backup is complete, enable other workers

The full barman backup has completed now. 6 more workers (total 8) have been enabled right after that. The postgres load has been a bit higher than before, but not unreasonably so.

The 8 remaining workers are being enabled again now, which means the infra is back in nominal conditions.

olasd updated the task description. (Show Details)
  • monitor the load after restarting all workers

Everythong has been running stable for a month now.

  • Make sure the automatic scrub works

The monthly automatic scrub ran on June 3rd, with no issues whatsoever.

I think this task can now be closed. I'll add followups for the rest of the items to do.