Page MenuHomeSoftware Heritage

Migrate production database servers to bullseye
Closed, MigratedEdits Locked

Description

This focuses on migrating the release of the (db) nodes from buster to bullseye.
This does not migrate the postgresql version in itself. This is another dedicated task [1].

[1] T2581

Servers to migrate:

  • db1.internal.staging.swh.network
  • belvedere.internal.softwareheritage.org
  • somerset.internal.softwareheritage.org

Plan:
staging:
Moved to this subtask T3813

Production:

  • upgrade somerset
    • switch the webapp to belvedere
    • on somerset
      • disable puppet
      • stop and disable postgresql
      • perform the last buster upgrade
      • reboot (restarted recently)
      • perform the bullseye upgrade
      • reboot
      • restart and enable postgresql
      • check the replication with belvedere is ok
    • switch back the webapp to somerset
  • upgrade of belvedere
    • add a notification in the status.io page
A database upgrade is scheduled the XXXX-XX-XX between XX:XX and XX:XX 
Some service disruptions can occur during this period

Impacted services:
- archive.softwareheritage.org
- Save code now
- Source code crawler
- deposit
  • connect to the idrac: https://swh9-adm.inria.fr/
  • stop the loaders and listers workers
  • stop the indexers
  • stop the scheduler runners + those in the tmux in saatchi
  • ensure the provenance experiment is stopped
  • on belvedere:
    • stop puppet
    • stop and disable postgresqls (to avoid the restarts after the server reboots) can be ignored
    • perform the last upgrade of buster
    • reboot
    • upgrade to bullseye
    • reboot
    • check everything is going well after the reboot
    • start and enable the postgresql servers
    • check the replication to somerset is ok
    • reactivate puppet
  • restart stopped services

Event Timeline

vsellier triaged this task as Normal priority.Dec 13 2021, 11:02 AM
vsellier created this task.

The following minor postgresql upgrades will be performed during the upgrade:

  • somerset: postgresql 13.4 -> 13.5 [1]

A dump/restore is not required for those running 13.X.

  • belvedere:
    • 11.14-0 -> 11.14-1 (indexer db)
    • 12.8-1 -> 12.9-1 [2] (other dbs)

A dump/restore is not required for those running 12.X.

  • db1:
    • 12.8-1 -> 12.9-1 [2]

[1] https://www.postgresql.org/docs/release/13.5/
[2] https://www.postgresql.org/docs/12/release-12-9.html

ardumont renamed this task from migrate database servers to bullseye to Migrate database servers to bullseye.Dec 14 2021, 5:10 PM
ardumont updated the task description. (Show Details)
vsellier renamed this task from Migrate database servers to bullseye to Migrate production database servers to bullseye.Dec 21 2021, 9:04 AM
vsellier changed the task status from Open to Work in Progress.
vsellier claimed this task.
vsellier updated the task description. (Show Details)

somerset

on moma:

  • puppet disabled
root@moma:/etc/softwareheritage/storage# puppet agent --disable 'T3801 upgrade database servers'
  • storage configuration update to use belvedere database and service restarted

on somerset:

  • last upgrade of buster applied:
root@somerset:~# apt list --upgradable
Listing... Done
libpq5/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1]
pgbouncer/buster-pgdg 1.16.1-1.pgdg100+1 amd64 [upgradable from: 1.16.0-1.pgdg100+1]
postgresql-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1]
postgresql-13/buster-pgdg 13.5-1.pgdg100+1 amd64 [upgradable from: 13.4-4.pgdg100+1]
postgresql-14/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1]
postgresql-client-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1]
postgresql-client-13/buster-pgdg 13.5-1.pgdg100+1 amd64 [upgradable from: 13.4-4.pgdg100+1]
postgresql-client-14/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1]
postgresql-client-common/buster-pgdg 232.pgdg100+1 all [upgradable from: 231.pgdg100+1]
postgresql-common/buster-pgdg 232.pgdg100+1 all [upgradable from: 231.pgdg100+1]
postgresql-plperl-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1]
postgresql-plpython3-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1]
postgresql/buster-pgdg 14+232.pgdg100+1 all [upgradable from: 14+231.pgdg100+1]

root@somerset:~# apt upgrade
  • postgresql has restarted correctly
2021-12-21 08:21:01 UTC [932629]: [3-1] LOG:  starting PostgreSQL 13.5 (Debian 13.5-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-12-21 08:21:01 UTC [932629]: [4-1] LOG:  listening on IPv6 address "::1", port 5433
2021-12-21 08:21:01 UTC [932629]: [5-1] LOG:  listening on IPv4 address "127.0.0.1", port 5433
2021-12-21 08:21:01 UTC [932629]: [6-1] LOG:  listening on IPv4 address "192.168.100.103", port 5433
2021-12-21 08:21:01 UTC [932629]: [7-1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2021-12-21 08:21:01 UTC [932631]: [1-1] LOG:  database system was shut down at 2021-12-21 08:20:53 UTC
2021-12-21 08:21:01 UTC [932631]: [2-1] LOG:  recovered replication state of node 1 to 258A0/2A30E0E8
2021-12-21 08:21:01 UTC [932629]: [8-1] LOG:  database system is ready to accept connections
2021-12-21 08:21:01 UTC [932638]: [1-1] LOG:  logical replication apply worker for subscription "softwareheritage_replica" has started
  • reboot no needed because only postgresql was updated
  • upgrade to bullseye
root@somerset:/etc# uptime
 08:33:09 up 10 days, 17:36,  2 users,  load average: 2.10, 2.27, 2.20
root@somerset:/etc# puppet agent --disable 'T3801'
root@somerset:/etc# sed -i -e 's/buster/bullseye/' /etc/apt/sources.list.d/*
root@somerset:/etc# sed -i -e 's,bullseye/updates,bullseye-security,' /etc/apt/sources.list.d/debian-security.list
root@somerset:/etc# git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   apt/sources.list.d/backports.list
	modified:   apt/sources.list.d/debian-security.list
	modified:   apt/sources.list.d/debian-updates.list
	modified:   apt/sources.list.d/debian.list
	modified:   apt/sources.list.d/hwraid_levert.list
	modified:   apt/sources.list.d/icinga-stable-release.list
	modified:   apt/sources.list.d/pgdg.list
	modified:   apt/sources.list.d/softwareheritage.list

no changes added to commit (use "git add" and/or "git commit -a")
root@somerset:/etc# grep bullseye-security /etc/apt/sources.list.d/debian-security.list
deb http://deb.debian.org/debian-security/ bullseye-security main
root@somerset:/etc# git add .
root@somerset:/etc# git commit -m "T3801: Migrate sources.list to bullseye"
root@somerset:/etc# CMD="apt -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold"
root@somerset:/etc# export DEBIAN_FRONTEND=noninteractive
root@somerset:/etc# $CMD upgrade -y
root@somerset:/etc# $CMD dist-upgrade -y
  • reboot
root@somerset:~# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • execute apt autoremove
  • reactivate and execute puppet
root@somerset:~# puppet agent --enable; puppet agent --test
...
Notice: Applied catalog in 26.27 seconds
  • postgresql is running correctly:
root@somerset:~# systemctl status postgresql@13-replica
* postgresql@13-replica.service - PostgreSQL Cluster 13-replica
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Tue 2021-12-21 08:50:46 UTC; 35s ago
    Process: 146 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-replica start (code=exited, status=0/SUCCESS)
   Main PID: 209 (postgres)
      Tasks: 14 (limit: 618987)
     Memory: 770.2M
        CPU: 16.087s
     CGroup: /system.slice/system-postgresql.slice/postgresql@13-replica.service
             |-209 /usr/lib/postgresql/13/bin/postgres -D /srv/softwareheritage/postgres/13/replica -c config_file=/etc/postgresql/13/replica/postgresql.conf
             |-268 postgres: 13/replica: logger
             |-283 postgres: 13/replica: checkpointer
             |-284 postgres: 13/replica: background writer
             |-285 postgres: 13/replica: walwriter
             |-286 postgres: 13/replica: autovacuum launcher
             |-287 postgres: 13/replica: stats collector
             |-288 postgres: 13/replica: logical replication launcher
             |-293 postgres: 13/replica: logical replication worker for subscription 99086720
             |-806 postgres: 13/replica: postgres softwareheritage 192.168.100.103(36940) idle
             |-807 postgres: 13/replica: guest softwareheritage 192.168.100.103(36942) idle
             |-819 postgres: 13/replica: guest softwareheritage 192.168.100.103(36944) idle
             |-820 postgres: 13/replica: guest softwareheritage 192.168.100.103(36946) idle
             `-823 postgres: 13/replica: autovacuum worker softwareheritage

Dec 21 08:50:43 somerset systemd[1]: Starting PostgreSQL Cluster 13-replica...
Dec 21 08:50:46 somerset systemd[1]: Started PostgreSQL Cluster 13-replica.

On moma:

  • reconfgure storage to use somerset
  • restart the service
  • reactivate puppet and launch it to encure the configuration is correct
root@moma:/etc/softwareheritage/storage# puppet agent --enable; puppet agent --test

Belvedere

A memory alert is logged on the idrac

	Correctable memory error logging disabled for a memory device at location DIMM_A9. 	Fri 17 Dec 2021 16:15:39

We will have to monitor in the future to check if this memory dimm has some weaknesses

  • before the upgrade:
% uptime
 09:10:41 up 277 days, 18:46,  4 users,  load average: 41.18, 38.10, 38.51
  • Stopping the worker:
clush -b -w @swh-workers 'set -e; puppet agent --disable T3801; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop --no-block swh-worker@*; sleep 300; systemctl kill swh-worker@* -s 9'
  • stop the indexers
root@pergamon:/etc/clustershell# clush -b -w @azure-workers 'set -e; puppet agent --disable T3801; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop --no-block swh-worker@*; sleep 300; systemctl kill swh-worker@* -s 9'
  • stop the scheduler runners + tmux
root@saatchi:~# puppet agent --disable T3801
root@saatchi:~# systemctl stop swh-scheduler*
  • In the tmux:
    • the visit_type=deb was not running
    • stopped:
swhscheduler@saatchi:~$ queue=oneshot3:swh.loader.git.tasks.UpdateGitRepository lister_uuid=860d41f8-d0c0-4733-a4d8-437c386bc31f; sleep=300; config=/etc/softwareheritage/scheduler/listener-runner.yml; while true; do   for policy in never_visited_oldest_update_first already_visited_order_by_lag; do     for visit_type in svn hg git; do       echo "$(date) scheduling $visit_type origins with policy ${policy}";       SWH_CONFIG_FILENAME=$config swh scheduler -C $config origin send-to-celery         --policy $policy         --queue $queue         --lister-uuid $lister_uuid         $visit_type;     done;     echo "$(date) sleep $sleep" ;         sleep $sleep;   done done
swhscheduler@saatchi:~$ lister_name=gitlab.com; lister_uuid=baf89663-feae-4850-a8ec-3a21e699cc0b; queue="oneshot3:swh.loader.git.tasks.UpdateGitRepository" ; visit_type=git; sleep=300; while true; do   for policy in never_visited_oldest_update_first never_visited_oldest_update_first never_visited_oldest_update_first already_visited_order_by_lag; do     echo "$(date) scheduling $visit_type origins with policy ${policy} to queue ${queue} for lister ${lister_name}";     SWH_CONFIG_FILENAME=/etc/softwareheritage/scheduler/listener-runner.yml       swh scheduler -C /etc/softwareheritage/scheduler/listener-runner.yml         origin send-to-celery --lister-uuid $lister_uuid --queue $queue --policy $policy $visit_type;     echo "$(date) sleep $sleep" ;     sleep $sleep;     done; done

On belvedere:

  • network configuration updated to comment the physical interfaces
  • (last upgrade buster upgrade ignored)
  • Upgrade to bullseye
    • upgrade the network configuration /etc/network/interfaces and comment the physical interface declarations
      • puppet disabled
root@belvedere:/etc/network# puppet agent --disable T3801
  • Upgrade to bullseye performed
  • everything is ok after the reboot
  • Restart the services
    • all the services restarted
    • all the scheduler services restarted as before
    • an upgrade of buster was performed on the azure workers before restarting the services

everything looks good \o/

vsellier updated the task description. (Show Details)