update commit message
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 24 2022
avoid unnecessary update if the no-sync-snap is not specified
Feb 23 2022
- backup01 vm created on azure
- zfs installed (will be reported in puppet):
- add contrib repository
- install zfs
# apt install linux-headers-cloud-amd64 zfs-dkms
- configure zfs pool
root@backup01:~# fdisk /dev/sdc -l Disk /dev/sdc: 200 GiB, 214748364800 bytes, 419430400 sectors Disk model: Virtual Disk Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: D0FB08C6-F046-F340-AC8B-D6C9372015D5
- Assign a static ip to not use an address in the middle of the workers
- Ensure the data disk is not deleted in case of accidental removal of the vm
- Use a supported rsa key
- fix the ssh-key provisioning
Feb 22 2022
Update facts:
- Remove the location entry
- add the deployment variable
- add the subnet variable
After the elasticsearch restart, there is no more message relative to any gc overhead in the logs but there were a couple of timeouts during the night.
Further investigations are needed
A workaround is deployed to restart the sync if it was interrupted by a race condition scenario
Feb 21 2022
Elastisearch was restarted and the sentry issues closed.
Let's monitor if the gcs are coming coming again
first, clean the unused resources, even if it will not free a lot of resources:
- aliases cleanup
vsellier@search-esnode0 ~ % export ES_SERVER=192.168.130.80:9200 vsellier@search-esnode0 ~ % curl -XGET http://$ES_SERVER/_cat/aliases origin-read origin-v0.11 - - - - origin-write origin-v0.11 - - - - origin-v0.9.0-read origin-v0.9.0 - - - - origin-v0.9.0-write origin-v0.9.0 - - - - vsellier@search-esnode0 ~ % curl -XDELETE http://$ES_SERVER/origin-v0.9.0/_alias/origin-v0.9.0-read {"acknowledged":true}% vsellier@search-esnode0 ~ % curl -XDELETE -H "Content-Type: application/json" http://$ES_SERVER/origin-v0.9.0/_alias/origin-v0.9.0-write {"acknowledged":true}%
The replication of object storage is now running correctly:
-- Journal begins at Thu 2022-02-17 04:52:45 UTC, ends at Mon 2022-02-21 07:44:15 UTC. -- Feb 17 15:41:22 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 17 15:41:23 db1 syncoid[283583]: INFO: Sending oldest full snapshot data/objects@syncoid_db1_2022-02-17:15:41:23 (~ 11811.3 GB) to new target filesystem: Feb 19 13:41:09 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 13:41:09 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 13:41:09 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 1d 10h 59min 6.865s CPU time. Feb 19 13:41:09 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 13:41:11 db1 syncoid[3716482]: Sending incremental data/objects@syncoid_db1_2022-02-17:15:41:23 ... syncoid_db1_2022-02-19:13:41:09 (~ 130.3 GB): Feb 19 14:29:18 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:29:18 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:29:18 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 25min 43.311s CPU time. Feb 19 14:29:18 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 14:29:25 db1 syncoid[1084137]: Sending incremental data/objects@syncoid_db1_2022-02-19:13:41:09 ... syncoid_db1_2022-02-19:14:29:18 (~ 5.3 GB): Feb 19 14:31:12 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:31:12 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:31:12 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 1min 7.439s CPU time. Feb 19 14:35:03 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 14:35:07 db1 syncoid[1174209]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:29:18 ... syncoid_db1_2022-02-19:14:35:04 (~ 710.1 MB): Feb 19 14:35:35 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:35:35 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:35:35 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 10.015s CPU time. Feb 19 14:40:48 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 14:40:52 db1 syncoid[1223955]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:35:04 ... syncoid_db1_2022-02-19:14:40:49 (~ 271.6 MB): Feb 19 14:41:14 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:41:14 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:41:14 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 5.701s CPU time. Feb 19 14:46:32 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 14:46:37 db1 syncoid[1267267]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:40:49 ... syncoid_db1_2022-02-19:14:46:33 (~ 461.8 MB): Feb 19 14:47:05 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:47:05 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:47:05 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 8.945s CPU time. Feb 19 14:52:18 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 19 14:52:22 db1 syncoid[1312265]: Sending incremental data/objects@syncoid_db1_2022-02-19:14:46:33 ... syncoid_db1_2022-02-19:14:52:19 (~ 263.2 MB): Feb 19 14:52:42 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded. Feb 19 14:52:42 db1 systemd[1]: Finished ZFS dataset synchronization of. Feb 19 14:52:42 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 6.021s CPU time. Feb 19 14:58:04 db1 systemd[1]: Starting ZFS dataset synchronization of... ...
Feb 18 2022
Feb 17 2022
looks like the server is short in heap
[2022-02-17T15:26:30,847][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965188] overhead, spent [408ms] collecting in the last [1s] [2022-02-17T15:27:08,154][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965225] overhead, spent [296ms] collecting in the last [1s] [2022-02-17T15:29:31,383][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][young][5965368][3283] duration [1s], collections [1]/[1.1s], total [1s]/[5.8m], memory [8.2gb]->[5.4gb]/[16gb], all_pools {[young] [2.8gb]->[0b]/[0b]}{[old] [4.7gb]->[5.3gb]/[16gb]}{[survivor] [652mb]->[184mb]/[0b]} [2022-02-17T15:29:31,384][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965368] overhead, spent [1s] collecting in the last [1.1s] [2022-02-17T15:31:49,449][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965506] overhead, spent [260ms] collecting in the last [1s] [2022-02-17T15:33:46,505][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965623] overhead, spent [256ms] collecting in the last [1s] [2022-02-17T15:37:11,728][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965828] overhead, spent [372ms] collecting in the last [1s] [2022-02-17T15:47:19,087][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966435] overhead, spent [289ms] collecting in the last [1s] [2022-02-17T15:49:56,439][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966592] overhead, spent [315ms] collecting in the last [1.1s] [2022-02-17T15:55:40,579][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966936] overhead, spent [274ms] collecting in the last [1s]
Objects replication:
- land D7180
- run puppet on db1 and storage1
- the sync automatically starts:
Feb 17 15:41:22 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 17 15:41:23 db1 syncoid[283583]: INFO: Sending oldest full snapshot data/objects@syncoid_db1_2022-02-17:15:41:23 (~ 11811.3 GB) to new target filesystem:
It will take some time to complete.
kafka data replication:
- prepare the dataset (ensure there is no mouts this time)
root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync/storage1 root@db1:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 736G 25.7T 96K /data data/postgres-indexer-12 96K 25.7T 96K /srv/softwareheritage/postgres/12/indexer data/postgres-main-12 733G 25.7T 729G /srv/softwareheritage/postgres/12/main data/postgres-misc 112K 25.7T 112K /srv/softwareheritage/postgres data/postgres-secondary-12 96K 25.7T 96K /srv/softwareheritage/postgres/12/secondary data/sync 192K 25.7T 96K none data/sync/storage1 96K 25.7T 96K none
- land D7179
- run puppet en db1 and storage
- initial synchronization started:
Feb 17 13:05:09 db1 syncoid[999999]: INFO: Sending oldest full snapshot data/kafka@syncoid_db1_2022-02-17:13:05:09 (~ 1686.6 GB) to new target filesystem:
Yes, my bad, it's due to T3911.
Feb 15 2022
- The initial synchronization took 2h20
- After a stabilization period, the synchronization is done every 5mn and takes ~1mn (the sizes are logged uncompressed and must be / by ~2.5 to have the real size)
D7173 landed. It initially focuses on the db1 -> storage1 replication to avoid having several initial replication at the same time. the storage1 -> db1 replication will be configured after the initial db1 replication will be done.
The replication will be done in this way (initiated by storage1):
db1 dataset data/postgres-main-12 replicated on storage1 /data/sync/db1/postgresql-main-12
fix the doc of the key name computation
Feb 14 2022
Feb 11 2022
Feb 10 2022
Feb 9 2022
Feb 8 2022
the first local snapshots worked:
root@dali:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT data 66.7G 126G 24K /data data/postgresql 66.6G 126G 66.6G /srv/postgresql/14/main data/postgresql@autosnap_2022-02-08_19:04:44_monthly 1.47M - 66.6G - data/postgresql@autosnap_2022-02-08_19:04:44_daily 194K - 66.6G - data/postgresql/wal 31.8M 126G 14.9M /srv/postgresql/14/main/pg_wal data/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly 16.3M - 31.3M - data/postgresql/wal@autosnap_2022-02-08_19:04:44_daily 13K - 15.0M -
rebase
The dali database directory tree was prepared to have a dedicated mount dataset for the wals:
root@dali:~# date Tue Feb 8 18:48:57 UTC 2022 root@dali:~# systemctl stop postgresql@14-main ● postgresql@14-main.service - PostgreSQL Cluster 14-main Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 5ms ago Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS) Main PID: 31293 (code=exited, status=0/SUCCESS) CPU: 1d 6h 12min 2.894s
use a template instead of stdlib::to_toml function not compatible with puppet 5
thanks, I will fix that.
update commit message
- add the postgresql backup management script
- ensure the snapshot of the wal is done after the postgresql snapshot
Update to only keep the local snapshot section.
The sync deployment will be implemented in another diff.
Feb 7 2022
the exporter is deployed.
The varnish stats are available on this dashboard: https://grafana.softwareheritage.org/d/pE2xMZank/varnish