Page MenuHomeSoftware Heritage

Upgrade zfs on all servers
Closed, MigratedEdits Locked

Description

Upgrade zfs packages on all servers using zfs pool

  • stop workers
  • stop journal clients
  • Upgrade all packages, reboot

Staging:

  • journal0 (already done)
  • swh-search0
  • storage1
  • db1

Production:

  • esnode1
  • esnode2
  • esnode3
  • search-esnode1
  • search-esnode2
  • search-esnode3
  • somerset (container / hypervisor packages)
  • beaubourg
  • saam
  • belvedere
  • kafka1
  • kafka2
  • kafka3
  • kafka4
  • branly
  • hypervisor3
  • pompidou
  • uffizi (already up-to-date)

Event Timeline

vsellier changed the task status from Open to Work in Progress.Mar 11 2021, 5:22 PM
vsellier triaged this task as High priority.
vsellier created this task.
vsellier moved this task from Backlog to in-progress on the System administration board.

swh-search0

  • stopping writes
root@search0:~# systemctl stop swh-search-journal-client@objects
root@search0:~# systemctl stop swh-search-journal-client@indexed
root@search0:~# puppet agent --disable "zfs upgrade"
``
- package upgrades
- `swh-search0` rebooted
- `swh-search0` rebooted
- all service are up and running
  • All workers and journal clients stopped before upgrading storage1 and db1
sudo clush -b -w @staging-workers 'puppet agent --disable "zfs upgrade"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'

storage1

  • package upgrade done
  • restart done without any problem
  • all ser

db1

  • package upgrade done
  • restart done without any problem
  • All workers reactivated
sudo clush -b -w @staging-workers 'systemctl default'
sudo clush -b -w @staging-workers 'puppet agent --enable; puppet agent --test'
sudo clush -b -w @staging-workers 'cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl start $unit; done'
  • journal client on search0 reactivated

It seems everything is ok. There is a lot of timeout on some database queries, but according to sentry to problem was already there before the upgrade:
https://sentry.softwareheritage.org/share/issue/dbf854e3362742bd924d5a0418b3ee00/

The production upgrade will be performed next week

Plan:

  • first the upgrade will be done on the elasticsearch server
  • in parallel somerset can be updated
  • after the webapp can be configured to use somerset as the principal database
  • Upgrade of belveder
  • upgrade of saam (with the help of @olasd)
  • upgrade of belvedere
  • and finally the kafka servers
  • *esnode*
    • delaying node down detection and limit shard allocation to primaries
esnode2 ~ % export ES_NODE=192.168.100.62:9200                                              
esnode2 ~ % curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'

{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%
  • package update
  • reboot
  • reactivate allocation
curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}'
  • wait for a green cluster and restart for the next node

plan for hypervisors / nodes upgrades:

  • beaubourg relative
  • workers[13..16]: stop services, upgrade package, shutdown, no needs to move them
  • stop azure workers
  • moma: upgrade packages, stop and restart on pompidou
  • tate: upgrade packages, stop and restart on pompidou
  • upgrade and stop somerset
  • upgrade beaubourg and restart beaubourg
  • restart somerset
  • upgrade moma configuration to use somerset as database
  • move back tate to beaubourg
  • move back moma to beaubourg
  • branly relative
  • pompidou: upgrade, stop and restart on uffizi
  • logstash0: upgrade, stop and restart on uffizi
  • saatchi: move on hypervisor3 (to avoid to stop all the workers)
  • louvre: move to hypervisor3
  • thyssen: upgrade, stop and restart on uffizi
  • riverside: upgrade stop and restart on hypervisor3
  • rp1: upgrade stop and restart on hypervisor3 (hedgedoc downtime)
  • bardo: upgrade stop and restart on hypervisor3 (Hedgedoc downtime)
  • kelvingrove: upgrade stop, restart on hypervisor3 (keycloak downtime)
  • boatbucket: upgrade stop restart on uffizi (and let it there ?)
  • [be continued...]
  • saam / belvedere:
  • (take the opportunity to upgrade and restart saatchi)
  • upgrade and restart saam
  • upgrade and restart belvedere
  • restart all the workers

All the servers were updated. We took the opportunity to upgrade and restart them to apply the last updates.

Relative to the proxmox restart, despite the fact to move the vms to a hypervisor with the same kind of cpu / restart them, the important configuration to change is to check the option noout : on the proxmox interface, select a hypervisor, > CEPH > OSD > noout to avoid a rebalance of the ceph component and uncheck after the migration is done.

For kafka, nothing particular:

  • upgrade
  • stop kafka
  • check the # of replicas on kafka manager [1] it should become red
  • restart the server
  • wait for the restart of kafka
  • check the # of replicas on kafka manager [1] everything should come back to normal (can take a couple of minutes after the restart of kafka)

upgrade the next server

[1] http://getty:9000/clusters/rocquencourt/topics