⚓ T3115 Upgrade zfs on all servers

		Status	Assigned	Task
		Migrated	gitlab-migration	T3081 ZFS failures detected on belvedere
		Migrated	gitlab-migration	T3115 Upgrade zfs on all servers

vsellier changed the task status from Open to Work in Progress.Mar 11 2021, 5:22 PM

vsellier triaged this task as High priority.

vsellier created this task.

vsellier moved this task from Backlog to in-progress on the System administration board.

vsellier updated the task description. (Show Details)Mar 11 2021, 5:25 PM

swh-search0

stopping writes

root@search0:~# systemctl stop swh-search-journal-client@objects
root@search0:~# systemctl stop swh-search-journal-client@indexed
root@search0:~# puppet agent --disable "zfs upgrade"
``
- package upgrades
- `swh-search0` rebooted
- `swh-search0` rebooted
- all service are up and running

vsellier updated the task description. (Show Details)Mar 11 2021, 5:43 PM

All workers and journal clients stopped before upgrading storage1 and db1

sudo clush -b -w @staging-workers 'puppet agent --disable "zfs upgrade"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'

storage1

package upgrade done
restart done without any problem
all ser

db1

package upgrade done
restart done without any problem

All workers reactivated

sudo clush -b -w @staging-workers 'systemctl default'
sudo clush -b -w @staging-workers 'puppet agent --enable; puppet agent --test'
sudo clush -b -w @staging-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl start $unit; done'

journal client on search0 reactivated

It seems everything is ok. There is a lot of timeout on some database queries, but according to sentry to problem was already there before the upgrade:
https://sentry.softwareheritage.org/share/issue/dbf854e3362742bd924d5a0418b3ee00/

The production upgrade will be performed next week

vsellier updated the task description. (Show Details)Mar 15 2021, 10:22 AM

vsellier updated the task description. (Show Details)Mar 18 2021, 9:41 AM

Plan:

first the upgrade will be done on the elasticsearch server
in parallel somerset can be updated
after the webapp can be configured to use somerset as the principal database
Upgrade of belveder
upgrade of saam (with the help of @olasd)
upgrade of belvedere
and finally the kafka servers

*esnode*
- delaying node down detection and limit shard allocation to primaries

esnode2 ~ % export ES_NODE=192.168.100.62:9200                                              
esnode2 ~ % curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'

{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%

package update
reboot
reactivate allocation

curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}'

wait for a green cluster and restart for the next node

vsellier updated the task description. (Show Details)Mar 18 2021, 10:57 AM

vsellier updated the task description. (Show Details)Mar 18 2021, 11:39 AM

vsellier updated the task description. (Show Details)Mar 18 2021, 12:17 PM

vsellier updated the task description. (Show Details)Mar 18 2021, 12:20 PM

vsellier updated the task description. (Show Details)Mar 18 2021, 1:18 PM

vsellier added a revision: D5278: webapp: use replica as main database on production.Mar 18 2021, 3:05 PM

vsellier added a commit: rSPSITE37f6fd13bfd8: webapp: use replica as main database on production.Mar 18 2021, 3:06 PM

vsellier updated the task description. (Show Details)Mar 18 2021, 3:12 PM

vsellier updated the task description. (Show Details)Mar 18 2021, 3:50 PM

vsellier updated the task description. (Show Details)Mar 19 2021, 9:28 AM

All the servers were updated. We took the opportunity to upgrade and restart them to apply the last updates.

Relative to the proxmox restart, despite the fact to move the vms to a hypervisor with the same kind of cpu / restart them, the important configuration to change is to check the option noout : on the proxmox interface, select a hypervisor, > CEPH > OSD > noout to avoid a rebalance of the ceph component and uncheck after the migration is done.

For kafka, nothing particular:

upgrade
stop kafka
check the # of replicas on kafka manager [1] it should become red
restart the server
wait for the restart of kafka
check the # of replicas on kafka manager [1] everything should come back to normal (can take a couple of minutes after the restart of kafka)

upgrade the next server

[1] http://getty:9000/clusters/rocquencourt/topics

ardumont moved this task from in-progress to done on the System administration board.Jul 29 2021, 1:23 PM

This task has been migrated to GitLab.

Upgrade zfs on all servers
Closed, MigratedEdits Locked
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

swh-search0

storage1

db1

rSPSITE puppet-swh-site
	D5278	rSPSITE37f6fd13bfd8 webapp: use replica as main database on production

Upgrade zfs on all serversClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

swh-search0

storage1

db1

Upgrade zfs on all servers
Closed, MigratedEdits Locked
Actions

Related Objects
Search...