Following D4651 we need to perform a rolling restart of the es nodes in production to apply the new puppet configuration (no changes, just reorganization).
Description
Revisions and Commits
rSPSITE puppet-swh-site | |||
D4651 Puppetize elasticsearch nodes | |||
D4674 | rSPSITE262e122fa89b monitoring: gather metrics into prometheus |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2852 Take back control on elasticsearch puppet manifests | ||
Migrated | gitlab-migration | T2888 Elasticsearch cluster failure during a rolling restart | ||
Migrated | gitlab-migration | T2903 Test different disk configuration on esnode1 | ||
Migrated | gitlab-migration | T2958 Use all the disks on esnode2 and esnode3 | ||
Migrated | gitlab-migration | T2959 Move the system partition on a soft raid on esnode* | ||
Migrated | gitlab-migration | T2960 Add disk health monitoring |
Event Timeline
A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.
The rolling upgrade procedure must be followed to reduce the impact on the cluster when a scheduled restart of a node is needed [1]
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html
The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.
It will be done on esnode3 and we will restart esnode1 and esnode2 immediately afterwards.
esnode3 was restarted and updated.
~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{ "persistent": { "cluster.routing.allocation.enable": "primaries" } }' {"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}% root@esnode3:~# systemctl disable elasticsearch Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable elasticsearch Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service. root@esnode3:~# apt update && apt upgrade -y && apt dist-upgrade -y ... root@esnode3:~# shutdown -r now
The reboot was going well due to a missing network configuration fixed by @olasd on all the nodes.
After the reboot, ES was updated by puppet :
root@esnode3:~# puppet agent --enable && puppet agent --test
As there was some network interruption during the configuration fix, the cluster needs to recover from the beginning. It's in progress :
~/src/swh/puppet-environment master* ❯ curl -s http://$ES_NODE:9200/_cat/nodes\?v; echo; curl -s http://$ES_NODE:9200/_cat/health\?v; echo; curl -s http://$ES_NODE:9200/_cat/indices | awk '{print $1}' | sort | uniq -c; ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.100.61 47 99 5 20.07 21.89 24.60 dilmrt - esnode1 192.168.100.63 39 99 6 5.40 5.81 8.87 dilmrt * esnode3 192.168.100.62 73 99 3 5.06 4.90 10.75 dilmrt - esnode2 epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1607093310 14:48:30 swh-logging-prod red 3 3 719 682 0 8 12619 4 1.7s 5.4% 3 close 14 green 2207 red 330 yellow
A new kernel release happened, we did not finish the rolling upgrade last time.
So here we go, we did the following both for esnode3 and esnode2:
$ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{ "persistent": { "cluster.routing.allocation.enable": "primaries" }, "transient": { "cluster.routing.allocation.enable": null } }' {"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}} $ curl -s http://$ES_NODE/_cluster/settings\?pretty { "persistent" : { "cluster" : { "routing" : { "allocation" : { "node_concurrent_incoming_recoveries" : "10", "node_concurrent_recoveries" : "3", "enable" : "primaries", "node_concurrent_outgoing_recoveries" : "10" } } }, "indices" : { "recovery" : { "max_bytes_per_sec" : "500MB" } }, "xpack" : { "monitoring" : { "elasticsearch" : { "collection" : { "enabled" : "false" } }, "collection" : { "enabled" : "false" } } } }, "transient" : { } } $ systemctl disable elasticsearch Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable elasticsearch Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service. $ apt update && apt upgrade -y && apt full-upgrade -y $ shutdown -r now
So status:
- esnode1's elasticsearch instance crashed during the shard redistribution after esnode2's elasticsearch instance restart.
- cluster went red (expectedly).
- Since we needed rebooting it anyways, we upgraded it as well and restarted.
And we got hit by a disk failure on esnode1 [1]
[1] T2888
Objective reached, we now deal with elasticsearch definitions through puppet.
The rest will be taken care of in dedicated task.