Page MenuHomeSoftware Heritage

Use all the disks on esnode2 and esnode3
Started, Work in Progress, NormalPublic

Description

Currently, only 3 disks (on 4) are used and the partition formerly used by kafka is lost (~2to).
The usage of zfs with the same partitioning scheme on each disk[1] is ok (tested on esnode1).
It can be applied on esnode[2-3] to have an homogeneous configuration and add ~1.8To per node.

[1]:

root@esnode1:~# fdisk -l /dev/sda
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS726020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9

Device         Start        End    Sectors  Size Type
/dev/sda1       2048    1050623    1048576  512M EFI System
/dev/sda2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sda3   79175680  141676543   62500864 29.8G Linux swap
/dev/sda4  141676544 3907028991 3765352448  1.8T Linux filesystem

Procedure for each node :

  • Disable shard allocation on the node to gently remove the node from the cluster
  • wait until the node is emptied
  • Stop the node
  • Install zfs and create the pool
  • Restart the node
  • Activate shard allocation on the node
  • Wait for the cluster rebalancing

Event Timeline

vsellier triaged this task as Normal priority.Tue, Jan 12, 10:11 AM
vsellier created this task.
vsellier updated the task description. (Show Details)
vsellier changed the task status from Open to Work in Progress.Wed, Jan 13, 9:21 AM
vsellier moved this task from Backlog to in-progress on the System administration board.

esnode2

  • Gently remove the node from the cluster :
❯ export ES_NODE=esnode3.internal.softwareheritage.org:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.62"
    }
}'

{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.62"
          }
        }
      }
    }
  }
}

The cluster is moving the shard properly :

❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/health\?v                                                                                 09:30:28
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610526631 08:30:31  swh-logging-prod green           3         3   8734 4367   20    0        0             0                  -                100.0%

We have to wait for the end of the shard reallocation (relo column)

The cluster state has changed to yellow, so it's not so smooth as expected :

 ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/health\?v                                                                                 09:45:01
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610527512 08:45:12  swh-logging-prod yellow          3         3   5812 4367   20    0     2922            40               7.5m                 66.5%

Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.

It will try to temporary reduce the number of parallel index initialization to reduce a little the pressure on the cluster

The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.

The indices.recovery.max_bytes_per_sec value was configured to 500MB. It seems it's viable for a startup with local indexes that just have to be read locally.
When the indexes have to be written from another node, it seems the write I/O pressure is so important than the sever can't write the cluster status locally. It results to its exclusion from the cluster and it has to restart its whole initialization process.

[2021-01-13T22:17:34,597][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [11805ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [2131] unchanged indices
[2021-01-13T22:17:56,318][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [15407ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [0] indices and skipped [2132] unchanged indices
[2021-01-13T22:19:31,436][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [15387ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [0] indices and skipped [2132] unchanged indices
...
[2021-01-13T19:26:43,791][WARN ][o.e.c.c.LagDetector      ] [esnode3] node [{esnode1}{t4iSb7f1RZmEwpH4O_OoGw}{CpoEqfyFS-yD7fNQJ624Ig}{192.168.100.61}{192.168.100.61:9300}{dilmrt}{ml.machine_memory=33635225600, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is lagging at cluster state version [444470], although publication of cluster state version [444471] completed [1.5m] ago
...

The cluster finally recovered by limiting the max recovering bytes per seconds to 50MB.

curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{
    "transient" : {
        "indices.recovery.max_bytes_per_sec" : "50MB"
    }
}'

The good news is it's seems it's not due to the new zfs migration as there is plenty of such exclusions on the previous logs (just a small extracts):

swh-logging-prod-2020-01-22-1.json.gz:{"type": "server", "timestamp": "2020-01-22T17:00:38,687Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [1m] publication of cluster state version [207982] is still waiting for {esnode2}{jTNROc0CQyCqUpItsfVMOQ}{gK5P8Q62Swa8redwMr_c-A}{192.168.100.62}{192.168.100.62:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ"  }
swh-logging-prod-2020-01-22-1.json.gz:{"type": "server", "timestamp": "2020-01-22T17:03:02,353Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [55.7s] publication of cluster state version [207985] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{ZXrh1hKTRXOYsEM_U2lztg}{192.168.100.61}{192.168.100.61:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ"  }


swh-logging-prod-2020-05-04-2.json.gz:{"type": "server", "timestamp": "2020-05-04T18:37:22,705Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [30s] publication of cluster state version [304674] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{Edz0GInGTR-LpoPV34OtHQ}{192.168.100.61}{192.168.100.61:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ"  }

swh-logging-prod-2020-12-04-5.log.gz:[2020-12-04T14:21:25,235][WARN ][o.e.c.c.C.CoordinatorPublication] [esnode3] after [43.8s] publication of cluster state version [361407] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{ZHWNVbprQiG3Cg9ZtZMYMQ}{192.168.100.61}{192.168.100.61:9300}{dilmrt}{ml.machine_memory=33688309760, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} [SENT_APPLY_COMMIT]

At one time or another the cluster was configured to use a limit of 40mb and the value was forced to 500mb as there is only logs on the way 40mb -> something :

, "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [200MB] to [500MB]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ"  }
/var/log/elasticsearch/swh-logging-prod-2020-01-22-1.log.gz:[2020-01-22T21:48:41,410][INFO ][o.e.c.s.ClusterSettings  ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [200MB]
/var/log/elasticsearch/swh-logging-prod-2020-01-22-1.log.gz:[2020-01-22T21:58:42,985][INFO ][o.e.c.s.ClusterSettings  ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [200MB] to [500MB]
/var/log/elasticsearch/swh-logging-prod-2020-06-30-1.json.gz:{"type": "server", "timestamp": "2020-06-30T06:03:04,337Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }
/var/log/elasticsearch/swh-logging-prod-2020-06-30-1.log.gz:[2020-06-30T06:03:04,337][INFO ][o.e.c.s.ClusterSettings  ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]
/var/log/elasticsearch/swh-logging-prod-2020-12-04-5.json.gz:{"type": "server", "timestamp": "2020-12-04T14:14:04,227Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }
/var/log/elasticsearch/swh-logging-prod-2020-12-04-5.log.gz:[2020-12-04T14:14:04,227][INFO ][o.e.c.s.ClusterSettings  ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]
/var/log/elasticsearch/swh-logging-prod-2020-12-14-1.json.gz:{"type": "server", "timestamp": "2020-12-14T12:50:16,482Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }
/var/log/elasticsearch/swh-logging-prod-2020-12-14-1.log.gz:[2020-12-14T12:50:16,482][INFO ][o.e.c.s.ClusterSettings  ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]

Having kept esnode2 in the cluster with no shards allocated allowed esnode3 to become the primary node of all the indexes so the cluster was not red when esnode1 was excluded from the cluster.

Now the cluster is stabilized, esnode2 can be stopped and migrated to zfs

  • installation and configuration of zfs on esnode2
    • backport packages installed
    • kernel upgraded to 5.0
root@esnode2:~# apt update
root@esnode2:~# apt list --upgradable
Listing... Done
libnss-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libpam-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libsystemd0/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libudev1/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
linux-image-amd64/buster-backports 5.9.15-1~bpo10+1 amd64 [upgradable from: 4.19+105+deb10u8]
systemd-sysv/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd-timesyncd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
udev/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
root@esnode2:~# apt dist-upgrade
root@esnode2:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
root@esnode2:~# puppet agent --disable "zfs installation"
root@esnode2:~# shutdown -r now
  • zfs installation
root@esnode2:~#  apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
  • kafka partition and old elasticsearch raid removed
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2021-01-14 09:05:59.609906708 +0000
+++ /etc/fstab	2021-01-14 09:06:49.390527123 +0000
@@ -9,8 +9,5 @@
 UUID=3700082d-41e5-4c54-8667-46280f124b33 /               ext4    errors=remount-ro 0       1
 # /boot/efi was on /dev/sda1 during installation
 UUID=0228-9320  /boot/efi       vfat    umask=0077      0       1
-#/srv/kafka was on /dev/sda4 during installation
-#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka      ext4    defaults        0       2
 # swap was on /dev/sda3 during installation
 UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none            swap    sw              0       0
-/dev/md0	/srv/elasticsearch	xfs	defaults,noatime	0 0
  • Removing old raid :
root@esnode2:~# mdadm --detail /dev/md0  
/dev/md0:
           Version : 1.2
     Creation Time : Wed May 23 08:21:35 2018
        Raid Level : raid0
        Array Size : 5860150272 (5588.67 GiB 6000.79 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Wed May 23 08:21:35 2018
             State : clean 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 512K

Consistency Policy : none

              Name : esnode2:0  (local to host esnode2)
              UUID : 9bdaa5d1:de7dbae8:6c57c7f9:55875c0b
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd

root@esnode2:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

There is no configuration defined on the /etc/mdadm/mdadm.conf file

  • Replicate sda partitionning to other disks:
root@esnode2:/etc/mdadm# sfdisk --wipe /dev/sdb
root@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdb
Checking that no-one is using this disk right now ... OK

The old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS726020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 8606F89B-F66E-4B25-AD6E-D7C8BF9EF2F6).
The old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

/dev/sdb1: Created a new partition 1 of type 'EFI System' and of size 512 MiB.
/dev/sdb2: Created a new partition 2 of type 'Linux filesystem' and of size 37.3 GiB.
/dev/sdb3: Created a new partition 3 of type 'Linux swap' and of size 29.8 GiB.
/dev/sdb4: Created a new partition 4 of type 'Linux filesystem' and of size 1.8 TiB.
/dev/sdb5: Done.

New situation:
Disklabel type: gpt
Disk identifier: 8606F89B-F66E-4B25-AD6E-D7C8BF9EF2F6

Device         Start        End    Sectors  Size Type
/dev/sdb1       2048    1050623    1048576  512M EFI System
/dev/sdb2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sdb3   79175680  141676543   62500864 29.8G Linux swap
/dev/sdb4  141676544 3907028991 3765352448  1.8T Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdc
...
root@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdd
...
  • zfs creation
root@esnode2:~# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4
root@esnode2:~# zpool status elasticsearch-data
  pool: elasticsearch-data
 state: ONLINE
  scan: none requested
config:

	NAME                STATE     READ WRITE CKSUM
	elasticsearch-data  ONLINE       0     0     0
	  sda4              ONLINE       0     0     0
	  sdb4              ONLINE       0     0     0
	  sdc4              ONLINE       0     0     0
	  sdd4              ONLINE       0     0     0

errors: No known data errors
root@esnode2:~# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   141K  7.00T        -         -     0%     0%  1.00x    ONLINE  -
root@esnode2:~# chown elasticsearch: /srv/elasticsearch/nodes
  • restart elasticsearch and remove the allocation exclusion
root@esnode2:~# puppet agent --enable
root@esnode2:~# systemctl enable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable elasticsearch
Created symlink /etc/systemd/system/multi-user.target.wants/elasticsearch.service -> /usr/lib/systemd/system/elasticsearch.service.
root@esnode2:~# systemctl start elasticsearch
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : null
    }
}'

{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}

The cluster has started the shard redistribution on the cluster, we have to wait for the stabilization before doing the same work on esnode3

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?s\=node\&v            12:35:32
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4368        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
    11        5.1gb     5.2gb      6.7tb      6.7tb            0 192.168.100.62 192.168.100.62 esnode2
  4367        4.4tb     4.4tb        1tb      5.4tb           81 192.168.100.63 192.168.100.63 esnode3
❯ curl -s http://192.168.100.63:9200/_cat/health\?v                         12:35:37
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610624162 11:36:02  swh-logging-prod green           3         3   8746 4373    2    0        0             0                  -                100.0%

After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.

The pool was recreated with the WWN[1] ids and it solved the issue. I will check after esnode3 will be migrated of the pool on esnode1 need to be changed

root@esnode2:~# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4)
root@esnode2:~# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   156K  7.00T        -         -     0%     0%  1.00x    ONLINE  -
root@esnode2:~# zpool status 
  pool: elasticsearch-data
 state: ONLINE
  scan: none requested
config:

	NAME                            STATE     READ WRITE CKSUM
	elasticsearch-data              ONLINE       0     0     0
	  wwn-0x5000cca25ecbedbb-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc2106-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc2c6d-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc4292-part4  ONLINE       0     0     0

errors: No known data errors

[1] https://fr.wikipedia.org/wiki/World_wide_name

The cluster is stabilized :

❯ curl -s http://192.168.100.63:9200/_cat/health\?v
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610689991 05:53:11  swh-logging-prod green           3         3   8758 4379    0    0        0             0                  -                100.0%
❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  2919          3tb       3tb      3.7tb      6.7tb           44 192.168.100.61 192.168.100.61 esnode1
  2919        2.8tb     2.8tb      3.9tb      6.7tb           42 192.168.100.62 192.168.100.62 esnode2
  2920        2.9tb     2.9tb      2.4tb      5.4tb           54 192.168.100.63 192.168.100.63 esnode3

Now it's esnode3 turn:

❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.63"
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.63"
          }
        }
      }
    }
  }
}
}'

esnode3 is ready to be migrated :

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                        09:09:53
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     0           0b     5.9gb      5.4tb      5.4tb            0 192.168.100.63 192.168.100.63 esnode3

esnode3 configured with the same procedure as esnode2 (check the previous comments)

root@esnode3:~# zpool status elasticsearch-data
  pool: elasticsearch-data
 state: ONLINE
  scan: none requested
config:

	NAME                            STATE     READ WRITE CKSUM
	elasticsearch-data              ONLINE       0     0     0
	  wwn-0x5000cca25ecc2116-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc2612-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc2727-part4  ONLINE       0     0     0
	  wwn-0x5000cca25ecc3498-part4  ONLINE       0     0     0

errors: No known data errors
root@esnode3:~# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   146K  7.00T        -         -     0%     0%  1.00x    ONLINE  -
  • puppet restarted and elasticsearch come back to the pool:
❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                09:38:33
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     0           0b     2.1mb      6.7tb      6.7tb            0 192.168.100.63 192.168.100.63 esnode3
  • allocation restriction removed :
curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{                                                                                                      09:38:37
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : null
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}
  • The shard reallocation restarts :
❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                        09:39:17
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4396        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4396        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     2      280.4mb     213mb      6.7tb      6.7tb            0 192.168.100.63 192.168.100.63 esnode3