Admin database backup
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Jan 25 2022, 5:27 PM

Description

Now a server is dedicated to the admin databases (T3833), we should backup the postgresql instance to ensure all the admin/support tools are correctly secured.

Revisions and Commits

rSPRE sysadm-provisioning
	D7220	rSPREa8a9c4ced4e0 backups: Add an azure backup vm
rSPSITE puppet-swh-site
	D7246	rSPSITEe5ca853821e6 sanoid: Add the configuration to manage the snapshot retention on backup01
	D7239	rSPSITE2880bf612fdd syncoid: Fix wrong timer frequency variable name
	D7235	rSPSITEd94c24ea30b7 backup: copy the dali snapshots to the azure's backup vm
	D7136	rSPSITE27e269717fbd icinga: don't try to monitor directories under the postgresql datadir
	D7118	rSPSITE9300ba9a5783 backups: implements a postgresql backup based on zfs snapshots

Related Objects

Mentioned In: rSPRE6ca24e0104a9: fix a minor typo
P1269 quick and dirty postgresql backup script based on zfs snapshot
Mentioned Here: P1269 quick and dirty postgresql backup script based on zfs snapshot
T3833: Dedicate one admin host to centralize administration dbs

Event Timeline

vsellier created this task.Jan 25 2022, 5:27 PM

vsellier updated the task description. (Show Details)

ardumont triaged this task as Normal priority.Jan 25 2022, 6:06 PM

ardumont moved this task from Backlog to Weekly backlog on the System administration board.

vsellier mentioned this in P1269 quick and dirty postgresql backup script based on zfs snapshot.Jan 27 2022, 10:26 AM

Unfornunately I didn't find any tool supporting natively the backup of postgresql through zfs snapshots.
We have a couple of alternatives:

Use borg to to implement a clean backup of the postgresql data directory based on hooks [1]
Use a tool like pgbackrest[2], the backup type and storage target need to be choosen (S3, storage server, ...)
Implement a backup based on zfs snapshots, which should not be too complicated but we need to manage all the plumbing to deal with full/incremental backups and the history cleaning

[1] https://torsion.org/borgmatic/#it's-your-data.-keep-it-that-way.
[2] https://pgbackrest.org/
[3] P1269

The zfs snapshot storage can be done in a plain file or sent to another server.
We can imagine in a first time sending the backup to banco

Example:

Creating an initial full backup

zfs send -v data/postgresql@001 | pigz -9 > /tmp/pg_001-full.gz

Creating an incremental backup based on the previous backup

zfs send -v -i data/postgresql@001 data/postgresql@002 | pigz -9 > /tmp/pg_002-inc.gz

On dali, a compressed full snapshot is taking 45Go and is taking ~15mn.
It can probably be improved by compressing on a server with more cpus

root@dali:~# zfs snapshot data/postgresql@test
root@dali:~# zfs list -t all
NAME                   USED  AVAIL     REFER  MOUNTPOINT
data                  61.9G   131G       24K  /data
data/postgresql       61.8G   131G     61.8G  /srv/postgresql/14/main
data/postgresql@test   374K      -     61.8G  -
root@dali:~# time zfs send -v data/postgresql@test | pigz -9 > /data/backup.gz
full send of data/postgresql@test estimated size is 66.9G
total estimated size is 66.9G
TIME        SENT   SNAPSHOT data/postgresql@test
09:39:43   37.9M   data/postgresql@test
09:39:44   73.7M   data/postgresql@test
...
09:55:38   66.9G   data/postgresql@test

real	15m57.296s
user	55m32.894s
sys	1m58.312s

root@dali:~# ls -alh /data
total 45G
drwxr-xr-x  2 root root    3 Jan 27 09:39 .
drwxr-xr-x 19 root root 4.0K Jan 26 06:03 ..
-rw-r--r--  1 root root  45G Jan 27 09:55 backup.gz

vsellier changed the task status from Open to Work in Progress.Jan 27 2022, 10:50 AM

ardumont moved this task from Weekly backlog to in-progress on the System administration board.Jan 28 2022, 3:36 PM

vsellier added a revision: D7118: backups: implements a zfs snapshot backup.Feb 8 2022, 11:14 AM

The dali database directory tree was prepared to have a dedicated mount dataset for the wals:

root@dali:~# date
Tue Feb  8 18:48:57 UTC 2022
root@dali:~# systemctl stop postgresql@14-main
● postgresql@14-main.service - PostgreSQL Cluster 14-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 5ms ago
    Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS)
   Main PID: 31293 (code=exited, status=0/SUCCESS)
        CPU: 1d 6h 12min 2.894s

Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main...
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded.
Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main.
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time.
Warning: journal has been rotated since unit was started, output may be incomplete.
root@dali:~# systemctl status postgresql@14-main
● postgresql@14-main.service - PostgreSQL Cluster 14-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 10s ago
    Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS)
   Main PID: 31293 (code=exited, status=0/SUCCESS)
        CPU: 1d 6h 12min 2.894s

Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main...
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded.
Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main.
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time.
Warning: journal has been rotated since unit was started, output may be incomplete.
root@dali:~# mv /srv/postgresql/14/main/pg_wal /srv/postgresql/14/main/pg_wal-save
root@dali:~# zfs create -o atime=off -o compression=zstd -o mountpoint=/srv/postgresql/14/main/pg_wal data/postgresql/wal
root@dali:~# zfs list
NAME                  USED  AVAIL     REFER  MOUNTPOINT
data                 66.8G   126G       24K  /data
data/postgresql      66.7G   126G     66.7G  /srv/postgresql/14/main
data/postgresql/wal    24K   126G       24K  /srv/postgresql/14/main/pg_wal
root@dali:~# cp -r /srv/postgresql/14/main/pg_wal-save/* /srv/postgresql/14/main/pg_wal
root@dali:~# chown -R postgres: /srv/postgresql/14/main/pg_wal
root@dali:~# systemctl start postgresql@14-main
root@dali:~# date
Tue Feb  8 18:51:23 UTC 2022

vsellier added a commit: rSPSITE9300ba9a5783: backups: implements a postgresql backup based on zfs snapshots.Feb 8 2022, 8:01 PM

the first local snapshots worked:

root@dali:~# zfs list -t all
NAME                                                       USED  AVAIL     REFER  MOUNTPOINT
data                                                      66.7G   126G       24K  /data
data/postgresql                                           66.6G   126G     66.6G  /srv/postgresql/14/main
data/postgresql@autosnap_2022-02-08_19:04:44_monthly      1.47M      -     66.6G  -
data/postgresql@autosnap_2022-02-08_19:04:44_daily         194K      -     66.6G  -
data/postgresql/wal                                       31.8M   126G     14.9M  /srv/postgresql/14/main/pg_wal
data/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly  16.3M      -     31.3M  -
data/postgresql/wal@autosnap_2022-02-08_19:04:44_daily      13K      -     15.0M  -

The remote sync will be implemented later.
The snapshot will probably be sent to an azure VM with some storage (it will depend of the test) as we don't have servers with free zfs storage (except saam but I'm not comfortable to use the archive's main zfs storage for that)

vsellier added a revision: D7136: icinga: don't try to monitor directories under the postgresql datadir.Feb 9 2022, 4:45 PM

vsellier added a commit: rSPSITE27e269717fbd: icinga: don't try to monitor directories under the postgresql datadir.Feb 10 2022, 11:19 AM

vsellier added a revision: D7220: backups: Add an azure backup vm.Feb 22 2022, 2:52 PM

backup01 vm created on azure
zfs installed (will be reported in puppet):
- add contrib repository
- install zfs

# apt install linux-headers-cloud-amd64 zfs-dkms

configure zfs pool

root@backup01:~# fdisk /dev/sdc -l
Disk /dev/sdc: 200 GiB, 214748364800 bytes, 419430400 sectors
Disk model: Virtual Disk    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D0FB08C6-F046-F340-AC8B-D6C9372015D5

Device         Start       End   Sectors  Size Type

root@backup01:~# zpool create data -o ashift=12   wwn-0x6002248004cd2d289007c07940782f60
root@backup01:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
data   360K   193G       96K  /data
root@backup01:~# zfs set canmount=noauto data
root@backup01:~# zfs set mountpoint=none data
root@backup01:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
data   480K   193G       96K  none

root@backup01:~# zfs create data/sync
root@backup01:~# zfs create data/sync/dali

root@backup01:~# zfs list 
NAME             USED  AVAIL     REFER  MOUNTPOINT
data             732K   193G       96K  none
data/sync        192K   193G       96K  none
data/sync/dali    96K   193G       96K  none

the opnsense gateway on azure needed to be updated to add the admin ip range (192.168.50.0/24)
New rule added on opnsense to allow ssh connection from the backup server

puppet configuration to replicate the admin database snapshots
- done in the following diffs

vsellier added a revision: D7235: backup: copy the dali snapshots to the azure's backup vm.Feb 24 2022, 9:08 AM

vsellier added a commit: rSPSITEd94c24ea30b7: backup: copy the dali snapshots to the azure's backup vm.Feb 24 2022, 10:47 AM

vsellier added a revision: D7239: syncoid: Fix wrong timer frequency variable name.Feb 24 2022, 11:27 AM

vsellier added a commit: rSPSITE2880bf612fdd: syncoid: Fix wrong timer frequency variable name.Feb 24 2022, 11:28 AM

vsellier added a revision: D7246: sanoid: Add the configuration to manage the snapshot retention on backup01.Feb 24 2022, 3:22 PM

vsellier added a commit: rSPSITEe5ca853821e6: sanoid: Add the configuration to manage the snapshot retention on backup01.Feb 24 2022, 4:43 PM

The replication is in place:

root@backup01:~# zfs list -t all
NAME                                                                 USED  AVAIL     REFER  MOUNTPOINT
data                                                                 120G  72.5G       96K  none
data/sync                                                            120G  72.5G       96K  none
data/sync/dali                                                       120G  72.5G       96K  none
data/sync/dali/postgresql                                            120G  72.5G     73.2G  none
data/sync/dali/postgresql@autosnap_2022-02-08_19:04:44_monthly      22.9G      -     72.3G  -
data/sync/dali/postgresql@autosnap_2022-02-18_00:00:01_daily        3.11G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-19_00:00:01_daily        2.43G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-20_00:00:01_daily        2.39G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-21_00:00:01_daily        2.44G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-22_00:00:00_daily        2.47G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-23_00:00:02_daily        2.56G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-24_00:00:00_daily           0B      -     73.2G  -
data/sync/dali/postgresql/wal                                        600M  72.5G     88.4M  none
data/sync/dali/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly  61.9M      -     61.9M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-18_00:00:01_daily    90.9M      -      107M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-19_00:00:01_daily    94.7M      -      111M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-20_00:00:01_daily    55.4M      -     87.5M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-21_00:00:01_daily    50.1M      -     98.2M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-22_00:00:00_daily    57.7M      -      106M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-23_00:00:02_daily    52.8M      -     68.8M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-24_00:00:00_daily       0B      -     88.4M  -

The retention will 2 monthly snapshots and 30 daily snapshots
The space occupation should be just around 200Go so we will probably have to extend a little the data disk.

The initial transfer took ~40mn

Feb 24 09:55:15 backup01 systemd[1]: Starting ZFS dataset synchronization of...
Feb 24 09:55:16 backup01 syncoid[93491]: NEWEST SNAPSHOT: autosnap_2022-02-24_00:00:00_daily
Feb 24 09:55:16 backup01 syncoid[93491]: INFO: Sending oldest full snapshot data/postgresql@autosnap_2022-02-08_19:04:44_monthly (~ 72.4 GB) to new target filesystem:
Feb 24 10:52:12 backup01 syncoid[93491]: INFO: Updating new target filesystem with incremental data/postgresql@autosnap_2022-02-08_19:04:44_monthly ... autosnap_2022-02-24_00:00:00_daily (~ 47.4 GB):
Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Succeeded.
Feb 24 11:29:17 backup01 systemd[1]: Finished ZFS dataset synchronization of.
Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Consumed 19min 51.324s CPU time.

The read speed on the amazon disk is the same as the write speed so in case of a need of full recovery, we can expect a recovery time < 1h.
We can probably reduce this time by changing the data disk configuration on azure but we will limited at some point by the vpn bandwidth.

vsellier added a commit: rSPREa8a9c4ced4e0: backups: Add an azure backup vm.Mar 14 2022, 1:40 PM

vsellier mentioned this in rSPRE6ca24e0104a9: fix a minor typo.Mar 14 2022, 1:44 PM

This task has been migrated to GitLab.

Admin database backupClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Admin database backup
Closed, MigratedEdits Locked
Actions