- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 18 2022
Feb 17 2022
looks like the server is short in heap
[2022-02-17T15:26:30,847][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965188] overhead, spent [408ms] collecting in the last [1s] [2022-02-17T15:27:08,154][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965225] overhead, spent [296ms] collecting in the last [1s] [2022-02-17T15:29:31,383][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][young][5965368][3283] duration [1s], collections [1]/[1.1s], total [1s]/[5.8m], memory [8.2gb]->[5.4gb]/[16gb], all_pools {[young] [2.8gb]->[0b]/[0b]}{[old] [4.7gb]->[5.3gb]/[16gb]}{[survivor] [652mb]->[184mb]/[0b]} [2022-02-17T15:29:31,384][WARN ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965368] overhead, spent [1s] collecting in the last [1.1s] [2022-02-17T15:31:49,449][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965506] overhead, spent [260ms] collecting in the last [1s] [2022-02-17T15:33:46,505][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965623] overhead, spent [256ms] collecting in the last [1s] [2022-02-17T15:37:11,728][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5965828] overhead, spent [372ms] collecting in the last [1s] [2022-02-17T15:47:19,087][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966435] overhead, spent [289ms] collecting in the last [1s] [2022-02-17T15:49:56,439][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966592] overhead, spent [315ms] collecting in the last [1.1s] [2022-02-17T15:55:40,579][INFO ][o.e.m.j.JvmGcMonitorService] [search-esnode0] [gc][5966936] overhead, spent [274ms] collecting in the last [1s]
Objects replication:
- land D7180
- run puppet on db1 and storage1
- the sync automatically starts:
Feb 17 15:41:22 db1 systemd[1]: Starting ZFS dataset synchronization of... Feb 17 15:41:23 db1 syncoid[283583]: INFO: Sending oldest full snapshot data/objects@syncoid_db1_2022-02-17:15:41:23 (~ 11811.3 GB) to new target filesystem:
It will take some time to complete.
kafka data replication:
- prepare the dataset (ensure there is no mouts this time)
root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync root@db1:~# zfs create -o canmount=noauto -o mountpoint=none data/sync/storage1 root@db1:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 736G 25.7T 96K /data data/postgres-indexer-12 96K 25.7T 96K /srv/softwareheritage/postgres/12/indexer data/postgres-main-12 733G 25.7T 729G /srv/softwareheritage/postgres/12/main data/postgres-misc 112K 25.7T 112K /srv/softwareheritage/postgres data/postgres-secondary-12 96K 25.7T 96K /srv/softwareheritage/postgres/12/secondary data/sync 192K 25.7T 96K none data/sync/storage1 96K 25.7T 96K none
- land D7179
- run puppet en db1 and storage
- initial synchronization started:
Feb 17 13:05:09 db1 syncoid[999999]: INFO: Sending oldest full snapshot data/kafka@syncoid_db1_2022-02-17:13:05:09 (~ 1686.6 GB) to new target filesystem:
Yes, my bad, it's due to T3911.
Feb 15 2022
- The initial synchronization took 2h20
- After a stabilization period, the synchronization is done every 5mn and takes ~1mn (the sizes are logged uncompressed and must be / by ~2.5 to have the real size)
D7173 landed. It initially focuses on the db1 -> storage1 replication to avoid having several initial replication at the same time. the storage1 -> db1 replication will be configured after the initial db1 replication will be done.
The replication will be done in this way (initiated by storage1):
db1 dataset data/postgres-main-12 replicated on storage1 /data/sync/db1/postgresql-main-12
fix the doc of the key name computation
Feb 14 2022
Feb 11 2022
Feb 10 2022
Feb 9 2022
Feb 8 2022
the first local snapshots worked:
root@dali:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT data 66.7G 126G 24K /data data/postgresql 66.6G 126G 66.6G /srv/postgresql/14/main data/postgresql@autosnap_2022-02-08_19:04:44_monthly 1.47M - 66.6G - data/postgresql@autosnap_2022-02-08_19:04:44_daily 194K - 66.6G - data/postgresql/wal 31.8M 126G 14.9M /srv/postgresql/14/main/pg_wal data/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly 16.3M - 31.3M - data/postgresql/wal@autosnap_2022-02-08_19:04:44_daily 13K - 15.0M -
rebase
The dali database directory tree was prepared to have a dedicated mount dataset for the wals:
root@dali:~# date Tue Feb 8 18:48:57 UTC 2022 root@dali:~# systemctl stop postgresql@14-main ● postgresql@14-main.service - PostgreSQL Cluster 14-main Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 5ms ago Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS) Main PID: 31293 (code=exited, status=0/SUCCESS) CPU: 1d 6h 12min 2.894s
use a template instead of stdlib::to_toml function not compatible with puppet 5
thanks, I will fix that.
update commit message
- add the postgresql backup management script
- ensure the snapshot of the wal is done after the postgresql snapshot
Update to only keep the local snapshot section.
The sync deployment will be implemented in another diff.
Feb 7 2022
the exporter is deployed.
The varnish stats are available on this dashboard: https://grafana.softwareheritage.org/d/pE2xMZank/varnish
Feb 4 2022
Feb 3 2022
use -- for all the options of the exporter configuration
minor update on documentation
- D7068 deployed and applied on the workers:
root@pergamon:/etc/clustershell# clush -b -w @workers -w worker17 -w worker18 "set -e; puppet agent --test" clush: 0/31 clush: in progress(31): worker[01-18],worker[01-13].euwest.azure --------------- worker01.euwest.azure --------------- Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for worker01.euwest.azure.internal.softwareheritage.org Info: Applying configuration version '1643885189' Notice: Applied catalog in 11.65 seconds ... ------------ worker18 --------------- Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for worker18.softwareheritage.org Info: Applying configuration version '1643885204' Notice: /Stage[main]/Profile::Mountpoints/Mount[/srv/storage/space]/options: options changed 'rw,soft,intr,rsize=8192,wsize=8192,noauto,x-systemd.automount,x-systemd.device-timeout=10' to 'ro,soft,intr,rsize=8192,wsize=8192,noauto,x-systemd.automount,x-systemd.device-timeout=10' Info: Computing checksum on file /etc/fstab Info: /Stage[main]/Profile::Mountpoints/Mount[/srv/storage/space]: Scheduling refresh of Mount[/srv/storage/space] Info: Mount[/srv/storage/space](provider=parsed): Remounting Notice: /Stage[main]/Profile::Mountpoints/Mount[/srv/storage/space]: Triggered 'refresh' from 1 event Info: /Stage[main]/Profile::Mountpoints/Mount[/srv/storage/space]: Scheduling refresh of Mount[/srv/storage/space] Notice: Applied catalog in 19.67 seconds clush: worker[01-18] (18): exited with exit code 2
completely remove the mountpoint to remove as the mount class
is not doing the cleanup when it's declared as absent.
Feb 2 2022
Feb 1 2022
Jan 31 2022
There are also:
- a LB for the postgresql replicas : https://portal.azure.com/#blade/Microsoft_Azure_Network/LoadBalancingHubMenuBlade/loadBalancers (swh-postgres-public)
- 2 cosmos db for provenance (almost empty)
Jan 28 2022
a few minor remarks