Page MenuHomeSoftware Heritage

Test different disk configuration on esnode1
Closed, MigratedEdits Locked

Description

After the 2 disks failure on the raid0 of esnode1, the whole raid need to be rebuilt.

As kafka has been removed from the ES nodes, we now have an additional ~2To partition available on each server that can be allocated to the ES storage.

Some notes we have found during the T2888 incident:

  • the os is installed on a single disk without replication. In case of a failure on the first disk, the system will be lost.
  • it seems adding the former kafka partition as a datadir is not an ideal solution (T2888#55004)
  • the partitioning of the disks that can be used for the raid is not homogeneous (one partition form the disk with the system + 3 complete disks), a raid0 with this 4 volumes will not be optimal
  • the ipmi console is not available for these server making a complete re-installation complicated

It has finally been chosen to apply the same partitioning mapping on all the disks with a part allocated to the system and the remaining partition allocated managed in a zfs pool
It will allow to have the same partition size for all the raid volume and to prepare a way to have to system available on a software raid1 replicated on the 4 disks (it will be configured later)

The test will be performed on esnode1 in a first time, to be able to monitor the performance impacts.

Event Timeline

vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Dec 22 2020, 12:08 PM

old raid cleanup

root@esnode1:~# umount /srv/elasticsearch 
root@esnode1:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2020-12-22 11:37:17.318967701 +0000
+++ /etc/fstab	2020-12-22 11:37:28.687049499 +0000
@@ -11,5 +11,3 @@
 UUID=AE23-D5B8  /boot/efi       vfat    umask=0077      0       1
 # swap was on /dev/sda3 during installation
 UUID=3eaaa22d-e1d2-4dde-9a45-d2fa22696cdf none            swap    sw              0       0
-UUID=6adb1e63-e709-4efb-8be1-76818b1b4751 /srv/kafka	ext4	errors=remount-ro	0 0
-/dev/md127	/srv/elasticsearch	xfs	defaults,noatime	0 0

root@esnode1:~# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu May 17 13:14:34 2018
        Raid Level : raid0
        Array Size : 5860150272 (5588.67 GiB 6000.79 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Thu May 17 13:14:34 2018
             State : clean 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 512K

Consistency Policy : none

              Name : esnode1:0  (local to host esnode1)
              UUID : b64355a8:98292747:698f5f37:c7dd00b5
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd


root@esnode1:~# mdadm --stop /dev/md127
mdadm: stopped /dev/md127

#raid configuration removed
root@esnode1:/etc/mdadm# diff -U3 mdadm.conf-save mdadm.conf
--- mdadm.conf-save	2020-12-22 11:50:19.960596899 +0000
+++ mdadm.conf	2020-12-22 11:50:32.744688812 +0000
@@ -15,7 +15,5 @@
 MAILADDR root
 
 # definitions of existing MD arrays
-ARRAY metadata=ddf UUID=4734f85a:a0508eae:afcafcbe:d3b25657
-ARRAY container=4734f85a:a0508eae:afcafcbe:d3b25657 member=0 UUID=f692bb46:8388cce8:10156cc3:0ba694ab
 
 # This configuration was auto-generated on Wed, 16 May 2018 16:17:57 +0200 by mkconf

Replicate disk sda partitioning on all disks

root@esnode1:~# sfdisk -l /dev/sda
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS726020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9

Device         Start        End    Sectors  Size Type
/dev/sda1       2048    1050623    1048576  512M EFI System
/dev/sda2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sda3   79175680  141676543   62500864 29.8G Linux swap
/dev/sda4  141676544 3907028991 3765352448  1.8T Linux filesystem

root@esnode1:~# sfdisk -d /dev/sda | tee /tmp/partitions
label: gpt
label-id: 543964DA-9ECA-4222-952D-BA8A90FAB2B9
device: /dev/sda
unit: sectors
first-lba: 34
last-lba: 3907029134

/dev/sda1 : start=        2048, size=     1048576, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=E0F12CCB-DD6D-4564-8E4B-EE8BA40B977C
/dev/sda2 : start=     1050624, size=    78125056, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=949EB714-958F-4619-A996-41DADFE1F49A
/dev/sda3 : start=    79175680, size=    62500864, type=0657FD6D-A4AB-43C4-84E5-0933C84B4F4F, uuid=34A0D0FF-C739-4EE5-92A9-43541978913C
/dev/sda4 : start=   141676544, size=  3765352448, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=8A823A75-EE20-44BF-A498-8792D53CC9F6

root@esnode1:~# sfdisk -d /dev/sda | sfdisk -f /dev/sdb  # executed also on sdc and sdd
Checking that no-one is using this disk right now ... OK

The old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS726020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 543964DA-9ECA-4222-952D-BA8A90FAB2B9).
The old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

/dev/sdb1: Created a new partition 1 of type 'EFI System' and of size 512 MiB.
/dev/sdb2: Created a new partition 2 of type 'Linux filesystem' and of size 37.3 GiB.
/dev/sdb3: Created a new partition 3 of type 'Linux swap' and of size 29.8 GiB.
/dev/sdb4: Created a new partition 4 of type 'Linux filesystem' and of size 1.8 TiB.
/dev/sdb5: Done.

New situation:
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9

Device         Start        End    Sectors  Size Type
/dev/sdb1       2048    1050623    1048576  512M EFI System
/dev/sdb2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sdb3   79175680  141676543   62500864 29.8G Linux swap
/dev/sdb4  141676544 3907028991 3765352448  1.8T Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

As puppet can be restart to avoid elasticsearch to restart before zfs is configured, zfs was manually installed :

  • contrib and non-free repo configured
  • zfs installation
apt update && apt install linux-image-amd64 linux-headers-amd64 libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
  • zfs configuration
root@esnode1:~# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes /dev/sda4 /dev/sdd4
root@esnode1:/srv/elasticsearch/nodes# zpool status elasticsearch-data 
  pool: elasticsearch-data
 state: ONLINE
  scan: none requested
config:

	NAME                STATE     READ WRITE CKSUM
	elasticsearch-data  ONLINE       0     0     0
	  sda4              ONLINE       0     0     0
	  sdd4              ONLINE       0     0     0

puppet could be restart, the server will be included on the cluster when elasticsearch will be restarted
The eventual impact of zfs on the performance could be monitored on the grafana dashboard dedicated to elasticsearch

  • puppet executed
  • esnode1 is back on the cluster but still not selected to received shard due to a configuration rule :
~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/nodes\?v; echo; curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/health\?v                                               16:02:37
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.100.61            3          57   0    0.35    0.25     0.12 dilmrt    -      esnode1
192.168.100.63           35          97   1    0.68    0.65     0.70 dilmrt    *      esnode3
192.168.100.62           35          96   2    0.66    0.75     0.82 dilmrt    -      esnode2

epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1608649359 15:02:39  swh-logging-prod green           3         3   8470 4235    0    0        0             0                  -                100.0%
~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cluster/settings\?pretty                                                                                                                   16:08:41
{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "10",
          "node_concurrent_recoveries" : "3",
          "node_concurrent_outgoing_recoveries" : "10"
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "500MB"
      }
    },
    "xpack" : {
      "monitoring" : {
        "elasticsearch" : {
          "collection" : {
            "enabled" : "false"
          }
        },
        "collection" : {
          "enabled" : "false"
        }
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.61"
          }
        }
      }
    }
  }
}

After its removal, the shard start to migrate back to esnode1:

~ ❯ curl -XPUT -H "Content-Type: application/json" http://esnode3.internal.softwareheritage.org:9200/_cluster/settings -d '{                                                                                                "transient": {
    "cluster.routing.allocation.exclude._ip": null
  }
}'
{"acknowledged":true,"persistent":{},"transient":{}}%
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1608650037 15:13:57  swh-logging-prod green           3         3   8470 4235    2    0        0             0                  -                100.0%

the disk pressure should start to reduce on the other nodes and we should be able to test the zfs impacts

The atime was activated by default. I switched to relatime :

root@esnode1:~# zfs get all  | grep time
elasticsearch-data  atime                 on                        default
elasticsearch-data  relatime              off                       default

root@esnode1:~# zfs set relatime=on elasticsearch-data
root@esnode1:~# zfs set atime=off elasticsearch-data

root@esnode1:~# zfs get all  | grep time
elasticsearch-data  atime                 off                       local
elasticsearch-data  relatime              on                        local

the shards reallocation is still in progress :

~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/shards\?h\=prirep,node | sort | uniq -c                                                                                                09:40:21
   1216 p esnode1
   1183 p esnode2
      1 p esnode2 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1
   1840 p esnode3
      1 p esnode3 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1
   1208 r esnode1
   1845 r esnode2
   1188 r esnode3

p: primary shard
r: replica shard

The pool on esnode1 is around 60% of usage

root@esnode1:~# zpool list
NAME                 SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
elasticsearch-data  3.50T  2.12T  1.38T         -     2%    60%  1.00x  ONLINE  -

The I/O wait is currently high on esnode1 but the pattern was the same on the other nodes during the last cluster rebalancing. We will have to wait for the end of the shard re-allocation to see if there is a difference between zfs or the previous software raid/xfs

The benchmark is done:

Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
esnode1-zfs-arc 63G  312k  99  478m  49  200m  43  640k  93  445m  53 400.8  31
Latency             31118us   58579us     748ms     231ms   78052us     275ms
Version  1.98       ------Sequential Create------ --------Random Create--------
esnode1-zfs-arc-lim -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  24 +++++ +++ 16384   7 16384  35 +++++ +++ 16384   6
Latency               145ms    2012us     826ms     105ms      21us     842ms
1.98,1.98,esnode1-zfs-arc-limited,1,1609729287,63G,,8192,5,312,99,489919,49,204649,43,640,93,455669,53,400.8,31,16,,,,,4059,24,+++++,+++,3023,7,11686,35,+++++,+++,2398,6,31118us,58579us,748ms,231ms,78052us,275ms,145ms,2012us,826ms,105ms,21us,842ms

(sorry for the formating, didn't find how to make it better)

Elasticsearch was restarted and the indexes are now rebalanced between the nodes

$ curl -s  http://192.168.100.62:9200/_cat/health\?v                                                                                                                                                                              
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1609942469 14:14:29  swh-logging-prod green           3         3   8650 4325    0    0        0             0                  -                100.0%

(no "relo" in progress)

This is the zpool status on esnode1:

root@esnode1:~# zpool list elasticsearch-data
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T  2.79T  4.21T        -         -     3%    39%  1.00x    ONLINE  -

The index systemlogs-2021.01.06 of the day is shared between esnode1 and esnode2, looking at system graphs and elasticsearch statistics, it seems there is no visible differences for the moment
I will keep the server under close monitoring until the next week before deciding if we are validating the zfs usage.

After a week of observation, there is no visible differences on the different system[1] and elasticsearch[2] monitoring.

The I/O is unbalanced[3] between the disks (sda/sdd) compared to (sdb/sdc) but it's probably because the 2 new disks were added on the raid and zfs has not rebalanced the raid.
It should be more equilibrated on other nodes where the zfs pool will be created from sratch.

[1] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=esnode1.internal.softwareheritage.org&from=now-24h&to=now
[2]https://grafana.softwareheritage.org/d/Hk5mBWJMz/elasticsearch?orgId=1&var-environment=production&var-cluster=swh-logging-prod&var-node=All&var-thread_pool=All&var-interval=$__auto_interval_interval&var-index=All&from=now-24h&to=now
[3] https://grafana.softwareheritage.org/d/q6c3_H0iz/system-overview?orgId=1&var-instance=esnode1.internal.softwareheritage.org&from=1610406000000&to=1610578799000