Take back control on elasticsearch puppet manifests
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Dec 4 2020, 2:20 PM

Description

Following D4651 we need to perform a rolling restart of the es nodes in production to apply the new puppet configuration (no changes, just reorganization).

Revisions and Commits

rSPSITE puppet-swh-site
	Closed		D4651 Puppetize elasticsearch nodes
		D4674	rSPSITE262e122fa89b monitoring: gather metrics into prometheus

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2852 Take back control on elasticsearch puppet manifests
Migrated	gitlab-migration	T2888 Elasticsearch cluster failure during a rolling restart
Migrated	gitlab-migration	T2903 Test different disk configuration on esnode1
Migrated	gitlab-migration	T2958 Use all the disks on esnode2 and esnode3
Migrated	gitlab-migration	T2959 Move the system partition on a soft raid on esnode*
Migrated	gitlab-migration	T2960 Add disk health monitoring

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 4 2020, 2:20 PM

vsellier triaged this task as Normal priority.

vsellier created this task.

vsellier added a revision: D4651: Puppetize elasticsearch nodes.

A naive upgrade was started but the cluster has collapsed with a node falling in OutOfMemory during the shard rebalancing.

The rolling upgrade procedure must be followed to reduce the impact on the cluster when a scheduled restart of a node is needed [1]

[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

The puppet configurarion is applied on esnode1 and esnode2 but we should have taken the opportunity to perform a system update.

It will be done on esnode3 and we will restart esnode1 and esnode2 immediately afterwards.

esnode3 was restarted and updated.

~/src/swh/puppet-environment master* ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}% 

root@esnode3:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.

root@esnode3:~# apt update && apt upgrade -y && apt dist-upgrade -y
...

root@esnode3:~# shutdown -r now

The reboot was going well due to a missing network configuration fixed by @olasd on all the nodes.

After the reboot, ES was updated by puppet :

root@esnode3:~# puppet agent --enable && puppet agent --test

As there was some network interruption during the configuration fix, the cluster needs to recover from the beginning. It's in progress :

~/src/swh/puppet-environment master* ❯ curl -s http://$ES_NODE:9200/_cat/nodes\?v; echo; curl -s http://$ES_NODE:9200/_cat/health\?v; echo; curl -s  http://$ES_NODE:9200/_cat/indices | awk '{print $1}' | sort | uniq -c; 
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.100.61           47          99   5   20.07   21.89    24.60 dilmrt    -      esnode1
192.168.100.63           39          99   6    5.40    5.81     8.87 dilmrt    *      esnode3
192.168.100.62           73          99   3    5.06    4.90    10.75 dilmrt    -      esnode2

epoch      timestamp cluster          status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1607093310 14:48:30  swh-logging-prod red             3         3    719 682    0    8    12619             4               1.7s                  5.4%

      3 close
     14 green
   2207 red
    330 yellow

ardumont added a revision: D4674: monitoring: gather metrics into prometheus.Dec 7 2020, 12:23 PM

vsellier added a commit: rSPSITE262e122fa89b: monitoring: gather metrics into prometheus.Dec 7 2020, 2:41 PM

vsellier created subtask T2888: Elasticsearch cluster failure during a rolling restart.Dec 14 2020, 10:15 PM

A new kernel release happened, we did not finish the rolling upgrade last time.
So here we go, we did the following both for esnode3 and esnode2:

$ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
   },
  "transient": {
    "cluster.routing.allocation.enable": null
  }
}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"enable":"primaries"}}}},"transient":{}}
$ curl -s http://$ES_NODE/_cluster/settings\?pretty
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "10",
          "node_concurrent_recoveries" : "3",
          "enable" : "primaries",
          "node_concurrent_outgoing_recoveries" : "10"
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "500MB"
      }
    },
    "xpack" : {
      "monitoring" : {
        "elasticsearch" : {
          "collection" : {
            "enabled" : "false"
          }
        },
        "collection" : {
          "enabled" : "false"
        }
      }
    }
  },
  "transient" : { }
}

$ systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.

$ apt update && apt upgrade -y && apt full-upgrade -y

$ shutdown -r now

So status:

esnode1's elasticsearch instance crashed during the shard redistribution after esnode2's elasticsearch instance restart.
cluster went red (expectedly).
Since we needed rebooting it anyways, we upgraded it as well and restarted.

And we got hit by a disk failure on esnode1 [1]

[1] T2888

Objective reached, we now deal with elasticsearch definitions through puppet.

The rest will be taken care of in dedicated task.

ardumont closed this task as Resolved.Dec 18 2020, 9:41 PM

ardumont claimed this task.

ardumont moved this task from Backlog to deployed/landed/monitoring on the System administration board.Jan 6 2021, 3:45 PM

vsellier closed subtask T2888: Elasticsearch cluster failure during a rolling restart as Resolved.Jan 12 2021, 12:38 PM

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.Apr 21 2021, 6:57 PM