Page MenuHomeSoftware Heritage

test postgres db restore
Closed, MigratedEdits Locked

Description

We now have live postgres backups on banco, with a retention policy of 4 weeks.

We need to do a test restore on a separate cluster, to:

  • ensure the backups work properly
  • benchmark how long it would take to do a restore in case of crash

Event Timeline

zack raised the priority of this task from to High.
zack updated the task description. (Show Details)
zack added a project: System administrators.
zack added a subscriber: zack.

this is now ongoing on banco, in a screen session of user barman

zack changed the task status from Open to Work in Progress.Dec 10 2015, 12:12 PM
Starting local restore for server swh using backup 20151204T074046
Destination directory: /srv/storage/0/barman-backup-restore-test-T237/
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 221: archive_command = false

WARNING
You are required to review the following options as potentially dangerous

postgresql.conf line 40: data_directory = '/srv/softwareheritage/postgres/9.4/main'             # use data in another directory
postgresql.conf line 42: hba_file = '/etc/postgresql/9.4/main/pg_hba.conf'      # host-based authentication file
postgresql.conf line 44: ident_file = '/etc/postgresql/9.4/main/pg_ident.conf'  # ident configuration file
postgresql.conf line 48: external_pid_file = '/var/run/postgresql/9.4-main.pid'                 # write an extra PID file
postgresql.conf line 88: ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'         # (change requires restart)
postgresql.conf line 89: ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'                # (change requires restart)

Your PostgreSQL server has been successfully prepared for recovery!

real    3042m17.094s
user    3284m31.816s
sys     227m2.004s

This is now done. It worked well, but it's really slow.

  • it took about 2.5 days to restore locally on banco, using our storage array. This part seems to be CPU-bound by the local rsync, so there is not much we can do to speed up things short of patching barman to use something else than rsync to copy. Also, in a real-life scenario we would need to restore remotely on prado, making rsync less avoidable. OTOH we did copy 1,8 TB of DB + 4 TB of WALs (see below).
  • it took about 2.5 days for postgres to restart, replaying the WALs accumulated during the base backup. This is probably largely overestimated w.r.t. a real-life scenario, because we restored a base backup that had the same amount of WALs than the size of the base backup (2.5TB x 2), due to a massive autovaccuum ongoing during the backup. For comparison, the subsequent backup, which we haven't tried restoring, has only 160 GB of WALs. So restoring that should be much faster

Bottom line: we should try restoring another, more sane backup to assess restore time more properly, but restore did work.

Also, if we want to have sub-day restore times after a potential crash, we should really consider other options, such as streaming replication to a hot spare.

This comment was removed by zack.
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:07 PM