Page MenuHomeSoftware Heritage

Dedicate one admin host to centralize administration dbs
Closed, ResolvedPublic

Description

Centralize the postgresql databases used for admin tools into a single dedicated host
(on which we would be able to do more proper backups and monitoring).

This is also the occasion to use the latest postgresql versions (each have their own and
inconsistent versions).

Impacted services:

  • hedgedoc (host: bardo, db: postgres-12)
  • netbox (host: bojimans, db: postgres-11)
  • grafana (host: pergamon, db: postgres-11)
  • sentry (host: riverside, db: postgres-12)
  • keycloak (host: kelvingrove, db: postgres-12)

This is the size of the current databases:

DatabaseSize
hedgedoc42MB
netbox22MB
grafana18MB
sentry99GB
keycloak15MB

The sentry database will force us to reserve a large amount of disk for the new server

Plan:

  • Leave services' configuration untouched to use local db
  • D6907, D6906: First create host database machine (vm).
  • T3833#76853: Create zfs data mount point (for the dbs' data)
  • D6928: Declare a dedicated puppet profile, this lists all required dbs to create (using profile::postgresql::server).
  • D6906: terraform (/vagrant) to boostrap (this applies puppet so the dbs get created)
  • T3833#76889: firewall: Open flux from vlan 440 (bojimans, kelvingrove, riverside, #pergamon) to vlan 442, port 5432
  • for each service in {D6946: netbox, D6947: hedgedoc, T3817: grafana, D6951: sentry, rSPSITE2b8a33e79d6e49554339e3b70134eb84e8cad7cf: keycloak}:
    • Stop the service (we don't have incremental dump so stop the service first)
    • Export and mount back data dump from old db to the new one
    • Adapt configuration to switch to the new db
    • puppet apply to restart service (which now uses the new db)
    • Ensure service is still ok

Annex actions (outside the scope of this task, like T3817):

  • T3850: Move services {netbox, sentry, keycloak} in the admin vlan (442) and behind the reverse proxy
  • T3849: Clean up leftovers after migration

Related to D6871#178665

Event Timeline

vsellier updated the task description. (Show Details)
vsellier updated the task description. (Show Details)

Point of attention for the encoding, collate and type of dbs [1]:

  • netbox (bojimans):
postgres=# \l
                              List of databases
   Name    |  Owner   | Encoding | Collate |  Ctype  |   Access privileges
-----------+----------+----------+---------+---------+-----------------------
 netbox    | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =T/postgres          +
           |          |          |         |         | postgres=CTc/postgres+
           |          |          |         |         | netbox=CTc/postgres
  • grafana, icinga2, icingaweb2 (pergamon)
postgres=# \l
                               List of databases
    Name    |  Owner   | Encoding | Collate |  Ctype  |    Access privileges
------------+----------+----------+---------+---------+-------------------------
 grafana    | postgres | UTF8     | C.UTF-8 | C.UTF-8 | postgres=CTc/postgres  +
            |          |          |         |         | =T/postgres            +
            |          |          |         |         | grafana=CTc/postgres
 icinga2    | postgres | UTF8     | C.UTF-8 | C.UTF-8 | postgres=CTc/postgres  +
            |          |          |         |         | =T/postgres            +
            |          |          |         |         | icinga2=CTc/postgres
 icingaweb2 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | postgres=CTc/postgres  +
            |          |          |         |         | =T/postgres            +
            |          |          |         |         | icingaweb2=CTc/postgres
  • hedgedoc (bardo):
postgres=# \l
                             List of databases
   Name    |  Owner   | Encoding  | Collate | Ctype |   Access privileges
-----------+----------+-----------+---------+-------+-----------------------
 hedgedoc  | hedgedoc | SQL_ASCII | C       | C     | =T/hedgedoc          +
           |          |           |         |       | hedgedoc=CTc/hedgedoc+
           |          |           |         |       | guest=c/hedgedoc
  • sentry, keycloak (belvedere):
postgres=# \l
                                            List of databases
                Name                |    Owner     | Encoding | Collate |  Ctype  |   Access privileges
------------------------------------+--------------+----------+---------+---------+-----------------------
 keycloak                           | keycloak     | UTF8     | C.UTF-8 | C.UTF-8 |
 postgres                           | postgres     | UTF8     | C.UTF-8 | C.UTF-8 |
 sentry                             | sentry       | UTF8     | C.UTF-8 | C.UTF-8 |

We will align all dbs to the following:

|----------+---------+---------|
| Encoding | Collate | Ctype   |
|----------+---------+---------|
| UTF8     | C.UTF-8 | C.UTF-8 |
|----------+---------+---------|

(netbox uses ascii which is a subset of UTF-8, it was badly set because the puppet
module configures that way by default and we did not change that default at the time)

[1] T3772

ardumont updated the task description. (Show Details)
  • build the vm dali (terraform apply)
  • Install the necessary zfs tools:
root@dali:~# cat /etc/apt/sources.list.d/debian-contrib.list
deb https://deb.debian.org/debian/ bullseye contrib
root@dali:~# apt install zfs-dkms
root@dali:~# modprobe zfs
root@dali:~# lsmod | grep -i zfs
zfs                  4558848  0
zunicode              335872  1 zfs
zzstd                 573440  1 zfs
zlua                  184320  1 zfs
zavl                   16384  1 zfs
icp                   323584  1 zfs
zcommon               102400  2 zfs,icp
znvpair               106496  2 zfs,zcommon
spl                   118784  6 zfs,icp,zzstd,znvpair,zcommon,zavl
  • build zfs data pool:
root@dali:~# zpool create data /dev/vdb
root@dali:~# zpool status -v
  pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          vdb       ONLINE       0     0     0

errors: No known data errors
root@dali:~# zfs create -o mountpoint=/srv/softwareheritage/postgres/14/main -o atime=off -o relatime=on -o compression=lz4 data/postgresql
root@dali:~# zfs list
NAME              USED  AVAIL     REFER  MOUNTPOINT
data              165K   193G       24K  /data
data/postgresql    24K   193G       24K  /srv/softwareheritage/postgres/14/main
root@dali:~# mount | grep postgresql
data/postgresql on /srv/softwareheritage/postgres/14/main type zfs (rw,noatime,xattr,noacl)
  • Reboot machine to check the data pool is correctly mounted at startup (it is)
ardumont@dali:~% uptime
 10:09:44 up 0 min,  2 users,  load average: 0.38, 0.15, 0.05
ardumont@dali:~% mount | grep postgresql
data/postgresql on /srv/softwareheritage/postgres/14/main type zfs (rw,noatime,xattr,noacl)

Note: T3487#71271

Well actually use the right expected folder (as per the puppet manifest actually configured):

root@dali:~# zfs destroy data/postgresql
root@dali:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
data   156K   193G       24K  /data
root@dali:~# zfs create -o mountpoint=/srv/postgresql/14/main -o atime=off -o relatime=on -o compression=lz4 data/postgresql
root@dali:~# mount | grep postgresql
data/postgresql on /srv/postgresql/14/main type zfs (rw,noatime,xattr,noacl)

Dumped prod hedgedoc db and mounted back the dump
into the vm vagrant to ensure no impact of the encoding/collation
change in between.

No issue whatsoever.

postgres@bardo:~$ pg_dump --clean --if-exists hedgedoc | gzip -c - > $(date +%Y-%m-%dT%H:%M:%SZ)-hedgedoc.sql.gz
...
postgres@dali:~$ gzip -dc 2022-01-13T11\:18\:11Z-hedgedoc.sql.gz | psql hedgedoc
SET
SET
SET
SET
SET
 set_config
------------

(1 row)

SET
SET
SET
SET
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
DROP INDEX
DROP INDEX
DROP INDEX
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP SEQUENCE
DROP TABLE
DROP TYPE
CREATE TYPE
ALTER TYPE
SET
SET
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
ALTER TABLE
ALTER SEQUENCE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
ALTER TABLE
COPY 662
COPY 2164
COPY 5318
COPY 18
COPY 21065
COPY 20611
COPY 0
COPY 0
COPY 27
 setval
--------
    672
(1 row)

ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
postgres@dali:~$ psql hedgedoc
psql (14.1 (Debian 14.1-1.pgdg110+1))
Type "help" for help.

hedgedoc=# \d
               List of relations
 Schema |      Name      |   Type   |  Owner
--------+----------------+----------+----------
 public | Authors        | table    | hedgedoc
 public | Authors_id_seq | sequence | hedgedoc
 public | Notes          | table    | hedgedoc
 public | Revisions      | table    | hedgedoc
 public | SequelizeMeta  | table    | hedgedoc
 public | Session        | table    | hedgedoc
 public | Sessions       | table    | hedgedoc
 public | Temp           | table    | hedgedoc
 public | Temps          | table    | hedgedoc
 public | Users          | table    | hedgedoc
(10 rows)
ardumont changed the task status from Open to Work in Progress.Thu, Jan 13, 12:43 PM
ardumont moved this task from Weekly backlog to in-progress on the System administration board.
  • dns updated: dali and alias db1 (.internal.admin.swh.network) are ok
$ ping -c3 db1.internal.admin.swh.network
PING dali.internal.admin.swh.network (192.168.50.50) 56(84) bytes of data.
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=1 ttl=63 time=4.51 ms
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=2 ttl=63 time=4.61 ms
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=3 ttl=63 time=4.79 ms

--- dali.internal.admin.swh.network ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 4.505/4.633/4.786/0.116 ms
$ ping -c3 dali.internal.admin.swh.network
PING dali.internal.admin.swh.network (192.168.50.50) 56(84) bytes of data.
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=1 ttl=63 time=4.52 ms
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=2 ttl=63 time=4.00 ms
64 bytes from dali.internal.admin.swh.network (192.168.50.50): icmp_seq=3 ttl=63 time=4.45 ms

--- dali.internal.admin.swh.network ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 3.997/4.320/4.518/0.230 ms
  • dbs installed:
postgres=# \l+
                                                                List of databases
   Name    |  Owner   | Encoding  | Collate |  Ctype  |   Access privileges   |  Size   | Tablespace |                Description
-----------+----------+-----------+---------+---------+-----------------------+---------+------------+--------------------------------------------
 grafana   | grafana  | UTF8      | C.UTF-8 | C.UTF-8 | =T/grafana           +| 8625 kB | pg_default |
           |          |           |         |         | grafana=CTc/grafana  +|         |            |
           |          |           |         |         | guest=c/grafana       |         |            |
 hedgedoc  | hedgedoc | UTF8      | C.UTF-8 | C.UTF-8 | =T/hedgedoc          +| 8625 kB | pg_default |
           |          |           |         |         | hedgedoc=CTc/hedgedoc+|         |            |
           |          |           |         |         | guest=c/hedgedoc      |         |            |
 keycloak  | keycloak | UTF8      | C.UTF-8 | C.UTF-8 | =T/keycloak          +| 8625 kB | pg_default |
           |          |           |         |         | keycloak=CTc/keycloak+|         |            |
           |          |           |         |         | guest=c/keycloak      |         |            |
 netbox    | netbox   | UTF8      | C.UTF-8 | C.UTF-8 | =T/netbox            +| 8625 kB | pg_default |
           |          |           |         |         | netbox=CTc/netbox    +|         |            |
           |          |           |         |         | guest=c/netbox        |         |            |
 postgres  | postgres | SQL_ASCII | C       | C       |                       | 8777 kB | pg_default | default administrative connection database
 sentry    | sentry   | UTF8      | C.UTF-8 | C.UTF-8 | =T/sentry            +| 8625 kB | pg_default |
           |          |           |         |         | sentry=CTc/sentry    +|         |            |
           |          |           |         |         | guest=c/sentry        |         |            |

Open flux in the firewall

  • 1 rule for pergamon (monitoring):
# without the rule to allow the icinga-server (pergamon for monitoring)
root@pergamon:~# telnet db1.internal.admin.swh.network 5432
Trying 192.168.50.50...
telnet: Unable to connect to remote host: Connection refused
# with the rule applied
root@pergamon:~# telnet db1.internal.admin.swh.network 5432
Trying 192.168.50.50...
Connected to dali.internal.admin.swh.network.
Escape character is '^]'.
^CConnection closed by foreign host.
  • another rule for other admin tools nodes [1]:
root@pergamon:~# for node in bojimans riverside kelvingrove; do ssh $node telnet dali.internal.admin.swh.network 5432; done
Trying 192.168.50.50...
Connected to dali.internal.admin.swh.network.
Escape character is '^]'.
q
quit
Connection closed by foreign host.
Trying 192.168.50.50...
Connected to dali.internal.admin.swh.network.
Escape character is '^]'.
quit
Connection closed by foreign host.
Trying 192.168.50.50...
Connected to dali.internal.admin.swh.network.
Escape character is '^]'.
quit
Connection closed by foreign host.

[1] Temporary rule because those are not in the correct admin vlan yet.

ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont renamed this task from Dedicate one admin host to centralize administration db to Dedicate one admin host to centralize administration dbs.Fri, Jan 14, 5:41 PM

(tested with vagrant first)

On belvedere: do the dump and send it to the admin db node:

root@belvedere:~# DUMP_NAME=/srv/softwareheritage/postgres/keycloak-20220117.sql.gz; time sudo -i -u postgres pg_dump --clean --if-exists keycloak | pigz -c - > $DUMP_NAME && scp -i .ssh/id_ed25519.borg $DUMP_NAME dali.internal.admin.swh.network:/srv/postgresql/14/main/

On dali: Migrate data to the admin db keycloak:

root@dali:/srv/postgresql/14/main# DUMPNAME=keycloak-20220117.sql.gz; zcat $DUMPNAME | sudo -i -u postgres psql -e keycloak

On kelvingrove: modify the db connection setup and restart service:

cd /opt/keycloak
sed -i -e 's/db.internal.softwareheritage.org/db1.internal.admin.swh.network/g' config.cli
grep db1 config.cli
./bin/jboss-cli.sh --file=config.cli
systemctl restart keycloak
systemctl status keycloak | grep Active

It uses the right connection:

root@kelvingrove:/opt/keycloak# ss -tan | grep 5432
TIME-WAIT 0      0               192.168.100.106:46034           192.168.50.50:5432
ESTAB     0      0               192.168.100.106:46036           192.168.50.50:5432
TIME-WAIT 0      0               192.168.100.106:46042           192.168.50.50:5432
TIME-WAIT 0      0               192.168.100.106:46038           192.168.50.50:5432
ESTAB     0      0               192.168.100.106:46040           192.168.50.50:5432
ESTAB     0      0               192.168.100.106:46044           192.168.50.50:5432

Finally, activate puppet back:

puppet agent --enable && puppet agent --test

No issue encountered on impacted services.
Closing this.

ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.