Page MenuHomeSoftware Heritage

Admin database backup
Closed, MigratedEdits Locked

Description

Now a server is dedicated to the admin databases (T3833), we should backup the postgresql instance to ensure all the admin/support tools are correctly secured.

Event Timeline

ardumont triaged this task as Normal priority.Jan 25 2022, 6:06 PM
ardumont moved this task from Backlog to Weekly backlog on the System administration board.

Unfornunately I didn't find any tool supporting natively the backup of postgresql through zfs snapshots.
We have a couple of alternatives:

  • Use borg to to implement a clean backup of the postgresql data directory based on hooks [1]
  • Use a tool like pgbackrest[2], the backup type and storage target need to be choosen (S3, storage server, ...)
  • Implement a backup based on zfs snapshots, which should not be too complicated but we need to manage all the plumbing to deal with full/incremental backups and the history cleaning

[1] https://torsion.org/borgmatic/#it's-your-data.-keep-it-that-way.
[2] https://pgbackrest.org/
[3] P1269

The zfs snapshot storage can be done in a plain file or sent to another server.
We can imagine in a first time sending the backup to banco

Example:

  • Creating an initial full backup
zfs send -v data/postgresql@001 | pigz -9 > /tmp/pg_001-full.gz
  • Creating an incremental backup based on the previous backup
zfs send -v -i data/postgresql@001 data/postgresql@002 | pigz -9 > /tmp/pg_002-inc.gz

On dali, a compressed full snapshot is taking 45Go and is taking ~15mn.
It can probably be improved by compressing on a server with more cpus

root@dali:~# zfs snapshot data/postgresql@test
root@dali:~# zfs list -t all
NAME                   USED  AVAIL     REFER  MOUNTPOINT
data                  61.9G   131G       24K  /data
data/postgresql       61.8G   131G     61.8G  /srv/postgresql/14/main
data/postgresql@test   374K      -     61.8G  -
root@dali:~# time zfs send -v data/postgresql@test | pigz -9 > /data/backup.gz
full send of data/postgresql@test estimated size is 66.9G
total estimated size is 66.9G
TIME        SENT   SNAPSHOT data/postgresql@test
09:39:43   37.9M   data/postgresql@test
09:39:44   73.7M   data/postgresql@test
...
09:55:38   66.9G   data/postgresql@test

real	15m57.296s
user	55m32.894s
sys	1m58.312s

root@dali:~# ls -alh /data
total 45G
drwxr-xr-x  2 root root    3 Jan 27 09:39 .
drwxr-xr-x 19 root root 4.0K Jan 26 06:03 ..
-rw-r--r--  1 root root  45G Jan 27 09:55 backup.gz
vsellier changed the task status from Open to Work in Progress.Jan 27 2022, 10:50 AM

The dali database directory tree was prepared to have a dedicated mount dataset for the wals:

root@dali:~# date
Tue Feb  8 18:48:57 UTC 2022
root@dali:~# systemctl stop postgresql@14-main
● postgresql@14-main.service - PostgreSQL Cluster 14-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 5ms ago
    Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS)
   Main PID: 31293 (code=exited, status=0/SUCCESS)
        CPU: 1d 6h 12min 2.894s

Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main...
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded.
Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main.
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time.
Warning: journal has been rotated since unit was started, output may be incomplete.
root@dali:~# systemctl status postgresql@14-main
● postgresql@14-main.service - PostgreSQL Cluster 14-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 10s ago
    Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS)
   Main PID: 31293 (code=exited, status=0/SUCCESS)
        CPU: 1d 6h 12min 2.894s

Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main...
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded.
Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main.
Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time.
Warning: journal has been rotated since unit was started, output may be incomplete.
root@dali:~# mv /srv/postgresql/14/main/pg_wal /srv/postgresql/14/main/pg_wal-save
root@dali:~# zfs create -o atime=off -o compression=zstd -o mountpoint=/srv/postgresql/14/main/pg_wal data/postgresql/wal
root@dali:~# zfs list
NAME                  USED  AVAIL     REFER  MOUNTPOINT
data                 66.8G   126G       24K  /data
data/postgresql      66.7G   126G     66.7G  /srv/postgresql/14/main
data/postgresql/wal    24K   126G       24K  /srv/postgresql/14/main/pg_wal
root@dali:~# cp -r /srv/postgresql/14/main/pg_wal-save/* /srv/postgresql/14/main/pg_wal
root@dali:~# chown -R postgres: /srv/postgresql/14/main/pg_wal
root@dali:~# systemctl start postgresql@14-main
root@dali:~# date
Tue Feb  8 18:51:23 UTC 2022

the first local snapshots worked:

root@dali:~# zfs list -t all
NAME                                                       USED  AVAIL     REFER  MOUNTPOINT
data                                                      66.7G   126G       24K  /data
data/postgresql                                           66.6G   126G     66.6G  /srv/postgresql/14/main
data/postgresql@autosnap_2022-02-08_19:04:44_monthly      1.47M      -     66.6G  -
data/postgresql@autosnap_2022-02-08_19:04:44_daily         194K      -     66.6G  -
data/postgresql/wal                                       31.8M   126G     14.9M  /srv/postgresql/14/main/pg_wal
data/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly  16.3M      -     31.3M  -
data/postgresql/wal@autosnap_2022-02-08_19:04:44_daily      13K      -     15.0M  -

The remote sync will be implemented later.
The snapshot will probably be sent to an azure VM with some storage (it will depend of the test) as we don't have servers with free zfs storage (except saam but I'm not comfortable to use the archive's main zfs storage for that)

  • backup01 vm created on azure
  • zfs installed (will be reported in puppet):
    • add contrib repository
    • install zfs
# apt install linux-headers-cloud-amd64 zfs-dkms
  • configure zfs pool
root@backup01:~# fdisk /dev/sdc -l
Disk /dev/sdc: 200 GiB, 214748364800 bytes, 419430400 sectors
Disk model: Virtual Disk    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D0FB08C6-F046-F340-AC8B-D6C9372015D5

Device         Start       End   Sectors  Size Type

root@backup01:~# zpool create data -o ashift=12   wwn-0x6002248004cd2d289007c07940782f60
root@backup01:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
data   360K   193G       96K  /data
root@backup01:~# zfs set canmount=noauto data
root@backup01:~# zfs set mountpoint=none data
root@backup01:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
data   480K   193G       96K  none
root@backup01:~# zfs create data/sync
root@backup01:~# zfs create data/sync/dali

root@backup01:~# zfs list 
NAME             USED  AVAIL     REFER  MOUNTPOINT
data             732K   193G       96K  none
data/sync        192K   193G       96K  none
data/sync/dali    96K   193G       96K  none
  • the opnsense gateway on azure needed to be updated to add the admin ip range (192.168.50.0/24)
  • New rule added on opnsense to allow ssh connection from the backup server
  • puppet configuration to replicate the admin database snapshots
    • done in the following diffs
vsellier moved this task from in-progress to done on the System administration board.

The replication is in place:

root@backup01:~# zfs list -t all
NAME                                                                 USED  AVAIL     REFER  MOUNTPOINT
data                                                                 120G  72.5G       96K  none
data/sync                                                            120G  72.5G       96K  none
data/sync/dali                                                       120G  72.5G       96K  none
data/sync/dali/postgresql                                            120G  72.5G     73.2G  none
data/sync/dali/postgresql@autosnap_2022-02-08_19:04:44_monthly      22.9G      -     72.3G  -
data/sync/dali/postgresql@autosnap_2022-02-18_00:00:01_daily        3.11G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-19_00:00:01_daily        2.43G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-20_00:00:01_daily        2.39G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-21_00:00:01_daily        2.44G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-22_00:00:00_daily        2.47G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-23_00:00:02_daily        2.56G      -     73.2G  -
data/sync/dali/postgresql@autosnap_2022-02-24_00:00:00_daily           0B      -     73.2G  -
data/sync/dali/postgresql/wal                                        600M  72.5G     88.4M  none
data/sync/dali/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly  61.9M      -     61.9M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-18_00:00:01_daily    90.9M      -      107M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-19_00:00:01_daily    94.7M      -      111M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-20_00:00:01_daily    55.4M      -     87.5M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-21_00:00:01_daily    50.1M      -     98.2M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-22_00:00:00_daily    57.7M      -      106M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-23_00:00:02_daily    52.8M      -     68.8M  -
data/sync/dali/postgresql/wal@autosnap_2022-02-24_00:00:00_daily       0B      -     88.4M  -

The retention will 2 monthly snapshots and 30 daily snapshots
The space occupation should be just around 200Go so we will probably have to extend a little the data disk.

The initial transfer took ~40mn

Feb 24 09:55:15 backup01 systemd[1]: Starting ZFS dataset synchronization of...
Feb 24 09:55:16 backup01 syncoid[93491]: NEWEST SNAPSHOT: autosnap_2022-02-24_00:00:00_daily
Feb 24 09:55:16 backup01 syncoid[93491]: INFO: Sending oldest full snapshot data/postgresql@autosnap_2022-02-08_19:04:44_monthly (~ 72.4 GB) to new target filesystem:
Feb 24 10:52:12 backup01 syncoid[93491]: INFO: Updating new target filesystem with incremental data/postgresql@autosnap_2022-02-08_19:04:44_monthly ... autosnap_2022-02-24_00:00:00_daily (~ 47.4 GB):
Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Succeeded.
Feb 24 11:29:17 backup01 systemd[1]: Finished ZFS dataset synchronization of.
Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Consumed 19min 51.324s CPU time.

The read speed on the amazon disk is the same as the write speed so in case of a need of full recovery, we can expect a recovery time < 1h.
We can probably reduce this time by changing the data disk configuration on azure but we will limited at some point by the vpn bandwidth.