Following D4651 we need to perform a rolling restart of the es nodes in production to apply the new puppet configuration (no changes, just reorganization).
Description
Revisions and Commits
| rSPSITE puppet-swh-site | |||
| D4651 Puppetize elasticsearch nodes | |||
| D4674 | rSPSITE262e122fa89b monitoring: gather metrics into prometheus | ||
| Status | Assigned | Task | ||
|---|---|---|---|---|
| Migrated | gitlab-migration | T2852 Take back control on elasticsearch puppet manifests | ||
| Migrated | gitlab-migration | T2888 Elasticsearch cluster failure during a rolling restart | ||
| Migrated | gitlab-migration | T2903 Test different disk configuration on esnode1 | ||
| Migrated | gitlab-migration | T2958 Use all the disks on esnode2 and esnode3 | ||
| Migrated | gitlab-migration | T2959 Move the system partition on a soft raid on esnode* | ||
| Migrated | gitlab-migration | T2960 Add disk health monitoring |
Event Timeline
A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.
The rolling upgrade procedure must be followed to reduce the impact on the cluster when a scheduled restart of a node is needed [1]
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html
The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.
It will be done on esnode3 and we will restart esnode1 and esnode2 immediately afterwards.
esnode3 was restarted and updated.
~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}%
root@esnode3:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
root@esnode3:~# apt update && apt upgrade -y && apt dist-upgrade -y
...
root@esnode3:~# shutdown -r nowThe reboot was going well due to a missing network configuration fixed by @olasd on all the nodes.
After the reboot, ES was updated by puppet :
root@esnode3:~# puppet agent --enable && puppet agent --test
As there was some network interruption during the configuration fix, the cluster needs to recover from the beginning. It's in progress :
~/src/swh/puppet-environment master* ❯ curl -s http://$ES_NODE:9200/_cat/nodes\?v; echo; curl -s http://$ES_NODE:9200/_cat/health\?v; echo; curl -s http://$ES_NODE:9200/_cat/indices | awk '{print $1}' | sort | uniq -c;
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.100.61 47 99 5 20.07 21.89 24.60 dilmrt - esnode1
192.168.100.63 39 99 6 5.40 5.81 8.87 dilmrt * esnode3
192.168.100.62 73 99 3 5.06 4.90 10.75 dilmrt - esnode2
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1607093310 14:48:30 swh-logging-prod red 3 3 719 682 0 8 12619 4 1.7s 5.4%
3 close
14 green
2207 red
330 yellowA new kernel release happened, we did not finish the rolling upgrade last time.
So here we go, we did the following both for esnode3 and esnode2:
$ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{ "persistent": { "cluster.routing.allocation.enable": "primaries" }, "transient": { "cluster.routing.allocation.enable": null } }' {"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}} $ curl -s http://$ES_NODE/_cluster/settings\?pretty { "persistent" : { "cluster" : { "routing" : { "allocation" : { "node_concurrent_incoming_recoveries" : "10", "node_concurrent_recoveries" : "3", "enable" : "primaries", "node_concurrent_outgoing_recoveries" : "10" } } }, "indices" : { "recovery" : { "max_bytes_per_sec" : "500MB" } }, "xpack" : { "monitoring" : { "elasticsearch" : { "collection" : { "enabled" : "false" } }, "collection" : { "enabled" : "false" } } } }, "transient" : { } } $ systemctl disable elasticsearch Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable elasticsearch Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service. $ apt update && apt upgrade -y && apt full-upgrade -y $ shutdown -r now
So status:
- esnode1's elasticsearch instance crashed during the shard redistribution after esnode2's elasticsearch instance restart.
- cluster went red (expectedly).
- Since we needed rebooting it anyways, we upgraded it as well and restarted.
And we got hit by a disk failure on esnode1 [1]
[1] T2888
Objective reached, we now deal with elasticsearch definitions through puppet.
The rest will be taken care of in dedicated task.