Page MenuHomeSoftware Heritage

Cross replicate the staging storage between db1 and storage1
Closed, MigratedEdits Locked

Description

The out of warranty servers used by bd1 and storage1 are enough for staging.
It would be a waste to replace them for new servers so we will kept them for a while.

As they are out of warranty, we still need to secure the data on them in case of a hardware issue.

The data is stored on zfs dataset and can be cross replicated between db1 and storage1.
The volume to replicate can be handled by the servers :

  • db1 -> storage1 in normal usage: <20 Mo/mn, in a peak of activity like VACCUM ~ 1Go/mn
  • storage1 -> db1 ~300Mo/mn

Event Timeline

vsellier renamed this task from Replicate the staging storage between db1 and storage1 to Cross replicate the staging storage between db1 and storage1.Feb 7 2022, 10:29 AM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Feb 10 2022, 2:35 PM
vsellier claimed this task.
vsellier moved this task from Backlog to in-progress on the System administration board.

D7173 landed. It initially focuses on the db1 -> storage1 replication to avoid having several initial replication at the same time. the storage1 -> db1 replication will be configured after the initial db1 replication will be done.
The replication will be done in this way (initiated by storage1):
db1 dataset data/postgres-main-12 replicated on storage1 /data/sync/db1/postgresql-main-12

When syncoid will be running, a snapshot will be created on both servers marking the last sync point.
For example from my local vms:

  • on db1:
root@db1:/srv/softwareheritage/postgres/12# zfs list -t snapshot
NAME                                                                                      USED  AVAIL     REFER  MOUNTPOINT
data/postgres-main-12@syncoid_storage1.internal.staging.swh.network_2022-02-15:10:47:39   288K      -      113M  -
  • on storage1
root@storage1:/etc/systemd/system# zfs list -t snapshot
NAME                                                                                                 USED  AVAIL     REFER  MOUNTPOINT
data/sync/db1/postgresql-main-12@syncoid_storage1.internal.staging.swh.network_2022-02-15:10:47:39     0B      -      113M  -
  • dataset preparations on storage1:
root@storage1:~# zfs list -t all
NAME           USED  AVAIL     REFER  MOUNTPOINT
data          13.0T  13.4T       96K  /data
data/kafka     712G  13.4T      712G  /srv/kafka
data/objects  12.3T  13.4T     12.3T  /srv/softwareheritage/objects

root@storage1:~# zfs create data/sync
root@storage1:~# zfs create data/sync/db1


root@storage1:~# zfs get all data/sync/db1 | grep compress
data/sync  compressratio         1.00x                  -
data/sync  compression           lz4                    inherited from data
data/sync  refcompressratio      1.00x                  -

syncoid will create the last level of the dataset tree.

  • run puppet on db1 and storage1
  • the sync started correctly:
root@storage1:~# journalctl -u syncoid-db1-postgresql-main-12.service
-- Journal begins at Tue 2022-02-08 09:32:24 UTC, ends at Tue 2022-02-15 11:04:40 UTC. --
Feb 15 11:03:24 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 11:03:25 storage1 syncoid[490545]: INFO: Sending oldest full snapshot data/postgres-main-12@syncoid_storage1_2022-02-15:11:03:25 (~ 1723.7 GB) to new target filesystem:

root@storage1:~# zfs list -t all
NAME                               USED  AVAIL     REFER  MOUNTPOINT
data                              13.0T  13.4T       96K  /data
data/kafka                         712G  13.4T      712G  /srv/kafka
data/objects                      12.3T  13.4T     12.3T  /srv/softwareheritage/objects
data/sync                         11.0G  13.4T       96K  /data/sync
data/sync/db1                     11.0G  13.4T       96K  /data/sync/db1
data/sync/db1/postgresql-main-12  11.0G  13.4T     11.0G  /data/sync/db1/postgresql-main-12
vsellier@db1 ~ % /usr/sbin/zfs list -t all
NAME                                                         USED  AVAIL     REFER  MOUNTPOINT
data                                                         731G  25.7T       96K  /data
data/postgres-indexer-12                                      96K  25.7T       96K  /srv/softwareheritage/postgres/12/indexer
data/postgres-main-12                                        728G  25.7T      726G  /srv/softwareheritage/postgres/12/main
data/postgres-main-12@syncoid_storage1_2022-02-15:11:03:25  1.79G      -      726G  -
data/postgres-misc                                           112K  25.7T      112K  /srv/softwareheritage/postgres
data/postgres-secondary-12                                    96K  25.7T       96K  /srv/softwareheritage/postgres/12/secondary

Let's monitor the initial replication and the behavior of the regular one

  • The initial synchronization took 2h20
  • After a stabilization period, the synchronization is done every 5mn and takes ~1mn (the sizes are logged uncompressed and must be / by ~2.5 to have the real size)
root@storage1:~# zfs list -t all
NAME                                                                    USED  AVAIL     REFER  MOUNTPOINT
data                                                                   13.7T  12.7T       96K  /data
data/kafka                                                              712G  12.7T      712G  /srv/kafka
data/objects                                                           12.3T  12.7T     12.3T  /srv/softwareheritage/objects
data/sync                                                               726G  12.7T       96K  /data/sync
data/sync/db1                                                           726G  12.7T       96K  /data/sync/db1
data/sync/db1/postgresql-main-12                                        726G  12.7T      726G  /data/sync/db1/postgresql-main-12
data/sync/db1/postgresql-main-12@syncoid_storage1_2022-02-15:13:50:18   208M      -      726G  -
Feb 15 11:03:24 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 11:03:25 storage1 syncoid[490545]: INFO: Sending oldest full snapshot data/postgres-main-12@syncoid_storage1_2022-02-15:11:03:25 (~ 1723.7 GB) to new target filesystem:
Feb 15 13:23:27 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:23:27 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:23:27 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 3h 43min 24.744s CPU time.
Feb 15 13:23:27 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 13:23:29 storage1 syncoid[2896448]: Sending incremental data/postgres-main-12@syncoid_storage1_2022-02-15:11:03:25 ... syncoid_storage1_2022-02-15:13:23:27 (~ 99.8 GB):
Feb 15 13:32:51 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:32:51 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:32:51 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 15min 5.414s CPU time.
Feb 15 13:32:51 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 13:32:54 storage1 syncoid[3090904]: Sending incremental data/postgres-main-12@syncoid_storage1_2022-02-15:13:23:27 ... syncoid_storage1_2022-02-15:13:32:52 (~ 13.9 GB):
Feb 15 13:34:22 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:34:22 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:34:22 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 2min 8.872s CPU time.
Feb 15 13:38:41 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 13:38:43 storage1 syncoid[3157306]: Sending incremental data/postgres-main-12@syncoid_storage1_2022-02-15:13:32:52 ... syncoid_storage1_2022-02-15:13:38:41 (~ 7.4 GB):
Feb 15 13:39:33 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:39:33 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:39:33 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 1min 8.053s CPU time.
Feb 15 13:44:29 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 13:44:31 storage1 syncoid[3210102]: Sending incremental data/postgres-main-12@syncoid_storage1_2022-02-15:13:38:41 ... syncoid_storage1_2022-02-15:13:44:30 (~ 8.0 GB):
Feb 15 13:45:28 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:45:28 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:45:28 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 1min 11.010s CPU time.
Feb 15 13:50:18 storage1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 15 13:50:20 storage1 syncoid[3271955]: Sending incremental data/postgres-main-12@syncoid_storage1_2022-02-15:13:44:30 ... syncoid_storage1_2022-02-15:13:50:18 (~ 6.5 GB):
Feb 15 13:51:07 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Succeeded.
Feb 15 13:51:07 storage1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 15 13:51:07 storage1 systemd[1]: syncoid-db1-postgresql-main-12.service: Consumed 1min 255ms CPU time.

kafka data replication:

  • prepare the dataset (ensure there is no mouts this time)
root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync
root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync/storage1
root@db1:~# zfs list
NAME                         USED  AVAIL     REFER  MOUNTPOINT
data                         736G  25.7T       96K  /data
data/postgres-indexer-12      96K  25.7T       96K  /srv/softwareheritage/postgres/12/indexer
data/postgres-main-12        733G  25.7T      729G  /srv/softwareheritage/postgres/12/main
data/postgres-misc           112K  25.7T      112K  /srv/softwareheritage/postgres
data/postgres-secondary-12    96K  25.7T       96K  /srv/softwareheritage/postgres/12/secondary
data/sync                    192K  25.7T       96K  none
data/sync/storage1            96K  25.7T       96K  none
  • land D7179
  • run puppet en db1 and storage
  • initial synchronization started:
Feb 17 13:05:09 db1 syncoid[999999]: INFO: Sending oldest full snapshot data/kafka@syncoid_db1_2022-02-17:13:05:09 (~ 1686.6 GB) to new target filesystem:
db1 ~ % /usr/sbin/zfs list -t all 
NAME                                                         USED  AVAIL     REFER  MOUNTPOINT
data                                                         749G  25.7T       96K  /data
data/postgres-indexer-12                                      96K  25.7T       96K  /srv/softwareheritage/postgres/12/indexer
data/postgres-main-12                                        730G  25.7T      729G  /srv/softwareheritage/postgres/12/main
data/postgres-main-12@syncoid_storage1_2022-02-17:13:05:53  1.39G      -      729G  -
data/postgres-misc                                           112K  25.7T      112K  /srv/softwareheritage/postgres
data/postgres-secondary-12                                    96K  25.7T       96K  /srv/softwareheritage/postgres/12/secondary
data/sync                                                   16.5G  25.7T       96K  none
data/sync/storage1                                          16.5G  25.7T       96K  none
data/sync/storage1/kafka                                    16.5G  25.7T     16.5G  none

The first replication took 3h20. Until stabilized, it should be a matter of seconds:

Feb 17 13:05:08 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 17 13:05:09 db1 syncoid[999999]: INFO: Sending oldest full snapshot data/kafka@syncoid_db1_2022-02-17:13:05:09 (~ 1686.6 GB) to new target filesystem:
Feb 17 15:24:19 db1 systemd[1]: syncoid-storage1-kafka.service: Succeeded.
Feb 17 15:24:19 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 17 15:24:19 db1 systemd[1]: syncoid-storage1-kafka.service: Consumed 3h 18min 46.955s CPU time.
Feb 17 15:24:19 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 17 15:24:21 db1 syncoid[170829]: Sending incremental data/kafka@syncoid_db1_2022-02-17:13:05:09 ... syncoid_db1_2022-02-17:15:24:20 (~ 6.9 GB):
Feb 17 15:24:58 db1 systemd[1]: syncoid-storage1-kafka.service: Succeeded.
Feb 17 15:24:58 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 17 15:24:58 db1 systemd[1]: syncoid-storage1-kafka.service: Consumed 54.214s CPU time.
Feb 17 15:30:09 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 17 15:30:10 db1 syncoid[216183]: Sending incremental data/kafka@syncoid_db1_2022-02-17:15:24:20 ... syncoid_db1_2022-02-17:15:30:10 (~ 222.0 MB):
Feb 17 15:30:12 db1 systemd[1]: syncoid-storage1-kafka.service: Succeeded.
Feb 17 15:30:12 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 17 15:30:12 db1 systemd[1]: syncoid-storage1-kafka.service: Consumed 1.919s CPU time.

Objects replication:

  • land D7180
  • run puppet on db1 and storage1
  • the sync automatically starts:
Feb 17 15:41:22 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 17 15:41:23 db1 syncoid[283583]: INFO: Sending oldest full snapshot data/objects@syncoid_db1_2022-02-17:15:41:23 (~ 11811.3 GB) to new target filesystem:

It will take some time to complete.

The replication of object storage is now running correctly:

-- Journal begins at Thu 2022-02-17 04:52:45 UTC, ends at Mon 2022-02-21 07:44:15 UTC. --
Feb 17 15:41:22 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 17 15:41:23 db1 syncoid[283583]: INFO: Sending oldest full snapshot data/objects@syncoid_db1_2022-02-17:15:41:23 (~ 11811.3 GB) to new target filesystem:
Feb 19 13:41:09 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 13:41:09 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 13:41:09 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 1d 10h 59min 6.865s CPU time.
Feb 19 13:41:09 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 13:41:11 db1 syncoid[3716482]: Sending incremental data/objects@syncoid_db1_2022-02-17:15:41:23 ... syncoid_db1_2022-02-19:13:41:09 (~ 130.3 GB):
Feb 19 14:29:18 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:29:18 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:29:18 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 25min 43.311s CPU time.
Feb 19 14:29:18 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 14:29:25 db1 syncoid[1084137]: Sending incremental data/objects@syncoid_db1_2022-02-19:13:41:09 ... syncoid_db1_2022-02-19:14:29:18 (~ 5.3 GB):
Feb 19 14:31:12 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:31:12 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:31:12 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 1min 7.439s CPU time.
Feb 19 14:35:03 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 14:35:07 db1 syncoid[1174209]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:29:18 ... syncoid_db1_2022-02-19:14:35:04 (~ 710.1 MB):
Feb 19 14:35:35 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:35:35 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:35:35 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 10.015s CPU time.
Feb 19 14:40:48 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 14:40:52 db1 syncoid[1223955]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:35:04 ... syncoid_db1_2022-02-19:14:40:49 (~ 271.6 MB):
Feb 19 14:41:14 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:41:14 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:41:14 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 5.701s CPU time.
Feb 19 14:46:32 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 14:46:37 db1 syncoid[1267267]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:40:49 ... syncoid_db1_2022-02-19:14:46:33 (~ 461.8 MB):
Feb 19 14:47:05 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:47:05 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:47:05 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 8.945s CPU time.
Feb 19 14:52:18 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 19 14:52:22 db1 syncoid[1312265]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:46:33 ... syncoid_db1_2022-02-19:14:52:19 (~ 263.2 MB):
Feb 19 14:52:42 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 19 14:52:42 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 19 14:52:42 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 6.021s CPU time.
Feb 19 14:58:04 db1 systemd[1]: Starting ZFS dataset synchronization of...
...

All the sync are correct with a maximum lag of 5mn:

root@db1:~# zfs list -t snapshot
NAME                                                         USED  AVAIL     REFER  MOUNTPOINT
data/postgres-main-12@syncoid_storage1_2022-02-21:07:44:04   690M      -      734G  -
data/sync/storage1/kafka@syncoid_db1_2022-02-21:07:42:13       0B      -      721G  -
data/sync/storage1/objects@syncoid_db1_2022-02-21:07:42:01     0B      -     12.6T  -
vsellier@storage1 ~ % /usr/sbin/zfs list -t snapshot 
NAME                                                                    USED  AVAIL     REFER  MOUNTPOINT
data/kafka@syncoid_db1_2022-02-21:07:42:13                             59.8M      -      721G  -
data/objects@syncoid_db1_2022-02-21:07:42:01                            907M      -     12.6T  -
data/sync/db1/postgresql-main-12@syncoid_storage1_2022-02-21:07:44:04     0B      -      734G  -

Regarding the monitoring, the disk space is monitored through the usual disk probes and the replication is monitored by the systemd status of the syncoid services.