Page MenuHomeSoftware Heritage

Take back control on elasticsearch puppet manifests
Closed, ResolvedPublic

Description

Following D4651 we need to perform a rolling restart of the es nodes in production to apply the new puppet configuration (no changes, just reorganization).

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 4 2020, 2:20 PM
vsellier triaged this task as Normal priority.
vsellier created this task.

A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.

The rolling upgrade procedure must be followed to reduce the impact on the cluster when a scheduled restart of a node is needed [1]

[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.

It will be done on esnode3 and we will restart esnode1 and esnode2 immediately afterwards.

esnode3 was restarted and updated.

~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}% 

root@esnode3:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.

root@esnode3:~# apt update && apt upgrade -y && apt dist-upgrade -y
...

root@esnode3:~# shutdown -r now

The reboot was going well due to a missing network configuration fixed by @olasd on all the nodes.

After the reboot, ES was updated by puppet :

root@esnode3:~# puppet agent --enable && puppet agent --test

As there was some network interruption during the configuration fix, the cluster needs to recover from the beginning. It's in progress :

~/src/swh/puppet-environment master* ❯ curl -s http://$ES_NODE:9200/_cat/nodes\?v; echo; curl -s http://$ES_NODE:9200/_cat/health\?v; echo; curl -s  http://$ES_NODE:9200/_cat/indices | awk '{print $1}' | sort | uniq -c; 
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.100.61           47          99   5   20.07   21.89    24.60 dilmrt    -      esnode1
192.168.100.63           39          99   6    5.40    5.81     8.87 dilmrt    *      esnode3
192.168.100.62           73          99   3    5.06    4.90    10.75 dilmrt    -      esnode2

epoch      timestamp cluster          status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1607093310 14:48:30  swh-logging-prod red             3         3    719 682    0    8    12619             4               1.7s                  5.4%

      3 close
     14 green
   2207 red
    330 yellow

A new kernel release happened, we did not finish the rolling upgrade last time.
So here we go, we did the following both for esnode3 and esnode2:

$ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
   },
  "transient": {
    "cluster.routing.allocation.enable": null
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}
$ curl -s http://$ES_NODE/_cluster/settings\?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "10",
          "node_concurrent_recoveries" : "3",
          "enable" : "primaries",
          "node_concurrent_outgoing_recoveries" : "10"
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "500MB"
      }
    },
    "xpack" : {
      "monitoring" : {
        "elasticsearch" : {
          "collection" : {
            "enabled" : "false"
          }
        },
        "collection" : {
          "enabled" : "false"
        }
      }
    }
  },
  "transient" : { }
}

$ systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.

$ apt update && apt upgrade -y && apt full-upgrade -y

$ shutdown -r now

So status:

  • esnode1's elasticsearch instance crashed during the shard redistribution after esnode2's elasticsearch instance restart.
  • cluster went red (expectedly).
  • Since we needed rebooting it anyways, we upgraded it as well and restarted.

And we got hit by a disk failure on esnode1 [1]

[1] T2888

Objective reached, we now deal with elasticsearch definitions through puppet.

The rest will be taken care of in dedicated task.

ardumont claimed this task.