⚓ T2888 Elasticsearch cluster failure during a rolling restart

Status	Assigned	Task
Migrated	gitlab-migration	T2852 Take back control on elasticsearch puppet manifests
Migrated	gitlab-migration	T2888 Elasticsearch cluster failure during a rolling restart
Migrated	gitlab-migration	T2903 Test different disk configuration on esnode1
Migrated	gitlab-migration	T2958 Use all the disks on esnode2 and esnode3
Migrated	gitlab-migration	T2959 Move the system partition on a soft raid on esnode*
Migrated	gitlab-migration	T2960 Add disk health monitoring

Event Timeline

vsellier changed the task status from Open to Work in Progress.Dec 14 2020, 10:15 PM

vsellier triaged this task as Normal priority.

vsellier created this task.

sdb and sdc on esnode1 have serious issues.
(there is no other disks with errors on other servers)

root@esnode1:~# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar 7K6000
Device Model:     HGST HUS726020ALA614
Serial Number:    K5GJBLTA
LU WWN Device Id: 5 000cca 25ec77181
Add. Product Id:  DELL(tm)
Firmware Version: A5DEKN35
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 14 21:15:35 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   90) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 288) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   135   135   000    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       39
  7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   140   140   000    Old_age   Offline      -       15
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       23277
 10 Spin_Retry_Count        0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       828
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       828
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 22/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       39
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       262
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       256
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       59916736810
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       14219306472

SMART Error Log Version: 1
ATA Error Count: 20 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 20 occurred at disk power-on lifetime: 23274 hours (969 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 70 10 90 c8 47 40 08  48d+09:48:05.467  READ FPDMA QUEUED
  60 08 18 28 27 c5 40 08  48d+09:48:02.868  READ FPDMA QUEUED
  60 b8 b8 00 d8 a0 40 08  48d+09:48:02.664  READ FPDMA QUEUED
  60 48 b0 b8 d9 a0 40 08  48d+09:48:02.664  READ FPDMA QUEUED
  60 08 88 f8 0b 59 40 08  48d+09:48:02.644  READ FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 23031 hours (959 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 70 a8 90 c8 47 40 08  38d+06:28:47.674  READ FPDMA QUEUED
  60 20 20 00 b0 a9 40 08  38d+06:28:44.377  READ FPDMA QUEUED
  60 08 98 70 5b 27 40 08  38d+06:28:44.314  READ FPDMA QUEUED
  60 08 18 c0 c7 e0 40 08  38d+06:28:44.300  READ FPDMA QUEUED
  60 08 90 58 21 34 40 08  38d+06:28:44.210  READ FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 23031 hours (959 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a8 00 28 46 40 08  38d+06:16:38.847  READ FPDMA QUEUED
  60 08 10 e8 2f 88 40 08  38d+06:16:36.084  READ FPDMA QUEUED
  60 08 b0 50 b0 56 40 08  38d+06:16:36.068  READ FPDMA QUEUED
  60 00 90 00 24 46 40 08  38d+06:16:36.056  READ FPDMA QUEUED
  60 08 a0 88 aa 83 40 08  38d+06:16:36.038  READ FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 23031 hours (959 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 70 68 90 c8 47 40 08  38d+06:16:34.914  READ FPDMA QUEUED
  60 08 18 90 80 d2 40 08  38d+06:16:32.002  READ FPDMA QUEUED
  60 08 70 58 5e 66 40 08  38d+06:16:31.982  READ FPDMA QUEUED
  60 08 10 a0 5e ba 40 08  38d+06:16:31.974  READ FPDMA QUEUED
  60 08 d0 c8 f7 a3 40 08  38d+06:16:31.963  READ FPDMA QUEUED

Error 16 occurred at disk power-on lifetime: 23025 hours (959 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 78 00 24 46 40 08  38d+00:51:54.327  READ FPDMA QUEUED
  60 00 70 00 d0 59 40 08  38d+00:51:51.553  READ FPDMA QUEUED
  60 00 68 00 cc 59 40 08  38d+00:51:51.553  READ FPDMA QUEUED
  60 00 70 00 50 3b 40 08  38d+00:51:51.518  READ FPDMA QUEUED
  60 88 68 00 e0 61 40 08  38d+00:51:51.513  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xdf)       Completed without error       00%        11         -
# 2  Short offline       Completed without error       00%        10         -
# 3  Vendor (0xff)       Completed without error       00%         9         -
# 4  Short offline       Completed without error       00%         6         -
# 5  Vendor (0xdf)       Completed without error       00%         4         -
# 6  Vendor (0xdf)       Completed without error       00%         2         -
# 7  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@esnode1:~# smartctl -a /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar 7K6000
Device Model:     HGST HUS726020ALA614
Serial Number:    K5GV9REA
LU WWN Device Id: 5 000cca 25ecbf626
Add. Product Id:  DELL(tm)
Firmware Version: A5DEKN35
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 14 21:16:46 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   90) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 288) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   136   136   000    Old_age   Offline      -       109
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       0 (Average 42)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       146
  7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   140   140   000    Old_age   Offline      -       15
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       23272
 10 Spin_Retry_Count        0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       12
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       798
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       798
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 22/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       146
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       75
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       59950425446
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       14194697779

SMART Error Log Version: 1
ATA Error Count: 88 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 88 occurred at disk power-on lifetime: 23271 hours (969 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 e0 40 34 06 40 08  48d+11:13:22.763  READ FPDMA QUEUED
  60 08 48 10 c6 02 40 08  48d+11:13:16.879  READ FPDMA QUEUED
  60 08 40 38 4f 64 40 08  48d+11:13:16.871  READ FPDMA QUEUED
  60 08 98 50 c4 6c 40 08  48d+11:13:16.859  READ FPDMA QUEUED
  60 08 a8 f0 1b f3 40 08  48d+11:13:16.850  READ FPDMA QUEUED

Error 87 occurred at disk power-on lifetime: 23271 hours (969 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 98 20 38 15 40 08  48d+11:13:14.563  READ FPDMA QUEUED
  60 20 38 00 4c ef 40 08  48d+11:13:10.838  READ FPDMA QUEUED
  60 20 50 20 88 cd 40 08  48d+11:13:10.801  READ FPDMA QUEUED
  60 08 48 c8 00 04 40 08  48d+11:13:10.756  READ FPDMA QUEUED
  60 08 40 58 84 f6 40 08  48d+11:13:10.752  READ FPDMA QUEUED

Error 86 occurred at disk power-on lifetime: 23271 hours (969 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 38 80 ca 4c 40 08  48d+11:10:29.373  READ FPDMA QUEUED
  60 08 80 a0 54 6e 40 08  48d+11:10:28.729  READ FPDMA QUEUED
  60 08 30 e8 17 04 40 08  48d+11:10:28.634  READ FPDMA QUEUED
  60 20 d0 20 4c 3b 40 08  48d+11:10:28.544  READ FPDMA QUEUED
  60 20 28 00 2c 0d 40 08  48d+11:10:28.533  READ FPDMA QUEUED

Error 85 occurred at disk power-on lifetime: 23271 hours (969 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 18 00 38 da 40 08  48d+11:10:02.331  READ FPDMA QUEUED
  61 08 e8 58 cc 91 40 08  48d+11:09:54.431  WRITE FPDMA QUEUED
  61 08 10 e0 6f d5 40 08  48d+11:09:54.426  WRITE FPDMA QUEUED
  47 00 01 12 00 00 a0 08  48d+11:09:54.415  READ LOG DMA EXT
  47 00 01 00 00 00 a0 08  48d+11:09:54.414  READ LOG DMA EXT

Error 84 occurred at disk power-on lifetime: 23271 hours (969 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 e8 00 34 da 40 08  48d+11:09:54.079  READ FPDMA QUEUED
  61 08 28 e0 6f d5 40 08  48d+11:09:46.193  WRITE FPDMA QUEUED
  60 08 30 88 8b f5 40 08  48d+11:09:44.442  READ FPDMA QUEUED
  61 78 e0 10 06 df 40 08  48d+11:09:44.262  WRITE FPDMA QUEUED
  60 00 d8 00 ec d8 40 08  48d+11:09:43.240  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xdf)       Completed without error       00%         6         -
# 2  Short offline       Completed without error       00%         5         -
# 3  Vendor (0xdf)       Completed without error       00%         2         -
# 4  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It seems there is a quite limited indices impacted by the corruption :

❯ curl -s  http://${ES_NODE}/_cat/indices | grep red                                                                                                        22:20:27
red    open  systemlogs-2020.08.30               o_gpFSjQRBuQBvWqaqA_dA 1 1                                   
red    open  systemlogs-2020.08.27               U4fKujQhTXmbGsx7zzLiPw 1 1                                   
red    open  systemlogs-2020.08.28               JVz-yhe4SeSow1TQPT61Jg 1 1                                   
red    open  systemlogs-2020.08.29               6avrSP3bRW2ZiwSlTpN0tA 1 1                                   
red    open  systemlogs-2020.08.22               jY7nPiXDS6a6aBnTDHNd1A 1 1                                   
red    open  systemlogs-2020.08.16               AK8wyDFQQ2KOgbzIdLvPqQ 1 1                                   
red    open  systemlogs-2020.08.13               o6OowHj-TMCBSglETaTj4w 1 1                                   
red    open  systemlogs-2020.08.10               NN0H_eaXQJW_20lsIMmg0Q 1 1                                   
red    open  systemlogs-2020.08.08               pkJVICAdSbqn3JgHU1h5Yw 1 1                                   
red    open  systemlogs-2020.09.07               naRyJEkZRCeOY5h_2avRyg 1 1                                   
red    open  systemlogs-2020.09.03               wb0DMaeqT2-Lh4nx8rafgQ 1 1                                   
red    open  systemlogs-2020.09.01               jelq1Ij5SGWQAKDqdbCYlQ 1 1                                   
red    open  swh_workers-2020.09.03              c1ZiRR8HS9W44T3nVd7f9Q 2 1  2733325        0   1.6gb    1.6gb
red    open  systemlogs-2020.07.24               743a1usWSw-whONPLhcKrA 1 1                                   
red    open  systemlogs-2020.07.25               zFkfn6l5SA-sby3A0SOAtw 1 1                                   
red    open  systemlogs-2020.07.17               PxL7sBrUQ8SXtbOEG5v_3A 1 1

~/src/swh/puppet-environment/swh-site staging ❯ curl -s  http://$ES_NODE/_cat/indices | awk '{print $1}' | sort | uniq -c                                                                                 22:21:58
      3 close
     91 green
     16 red
   2474 yellow

xfs has shutdown the partition so ES is lost .

(extract of dmesg)
[ 2911.868611] print_req_error: I/O error, dev sdc, sector 406619
[ 2911.875907] ata3: EH complete
[ 2911.875988] XFS (md127): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x6a040 len 32 error 5
[ 2911.886901] XFS (md127): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 2911.886903] XFS (md127): xfs_do_force_shutdown(0x8) called from line 3417 of file fs/xfs/xfs_inode.c.  Return address = 000000009c75e059
[ 2911.909556] XFS (md127): Corruption of in-memory data detected.  Shutting down filesystem

(extract of /var/log/elasticsearch/swh-logging-prod.log)
[2020-12-14T21:25:57,712][WARN ][o.e.i.s.RetentionLeaseSyncAction] [esnode1] [swh_workers-2020.09.03][0] retention lease background sync failed
org.elasticsearch.transport.RemoteTransportException: [esnode1][192.168.100.61:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
Caused by: org.elasticsearch.gateway.WriteStateException: failed to open state directory /srv/elasticsearch/nodes/0/indices/c1ZiRR8HS9W44T3nVd7f9Q/0/_state

The service on es1 was stopped is it was only generating noise on the logs due to failing shard repartition in the cluster.
The shard reallocation, stopped during the rolling upgrade, was reactivated to maximize the shards duplication on the 2 remaining nodes.

~ ❯ curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{                                                                                                                22:23:11
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}'
{"acknowledged":false,"persistent":{},"transient":{}}

Fortunately, the cluster is indexing the new logs correctly has all the recent indexes are at least in the yellow state:

~ ❯ curl -s  http://$ES_NODE/_cat/indices | grep 2020.12.14                                                                                                                                               22:42:07
green  open  swh_workers-2020.12.14              cxpt8dFCS--dhb6MaMQVmw 2 1  4840151        0   6.8gb    3.4gb
yellow open  apache_logs-2020.12.14              NotcgMTVRIS6o1cDfOkQNw 3 1    79265        0  82.5mb   53.8mb
yellow open  systemlogs-2020.12.14               n6C4s_gUQaC4i08Yji1ppA 1 1  1987986        0   2.1gb    2.1gb

ardumont mentioned this in T2852: Take back control on elasticsearch puppet manifests.Dec 15 2020, 9:35 AM

Short term plan :

Remove old systemlogs indexes older than 1year to start, but we can go to 3 months if necessary
reactivate the shard allocation to have 1 replica for all the shards in case of a second node failure
Launch a long smartcl test on all the disks of each esnode* server
Contact DELL support to proceed to the replacement of the 2 failing disks (under warranty(?)) [1]
Try to recover the 16 red indexes if possible, if not, delete them as they are not critical

Middle term:

Reconfigure sentry to use its local kafka instance instead of the esnode* kafka cluster
Cleanup the esnode* kafka/zookeeper instances and reclaim the 2To disk reserved for the journal
Add a new datadir on elasticsearch using the new available disk
Add smartctl monitoring to detect disk failure as soon as possible

[1]

sdb serial : K5GJBLTA
sdc serial : K5GV9REA

vsellier raised the priority of this task from Normal to Unbreak Now!.Dec 15 2020, 10:54 AM

vsellier updated the task description. (Show Details)

cleanup of systemlogs index before 2020 (2018/2019)

disk space before

sellier@esnode3 ~ % curl -s http://${ES_NODE}/_cat/indices | awk '{print $3}' | grep "systemlogs.*2018" | sort | wc -l
120

It was not a full year

vsellier@esnode2 ~ % df -h /srv/elasticsearch 
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        5.5T  4.4T  1.2T  80% /srv/elasticsearch

vsellier@esnode3 ~ % df -h /srv/elasticsearch
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      5.5T  4.7T  815G  86% /srv/elasticsearch

2018 cleanup

vsellier@esnode3 ~ % curl -s http://${ES_NODE}/_cat/indices | awk '{print $3}' | grep "systemlogs.*2018" | sort | xargs -t -n1 -i{} -r curl -XDELETE http://${ES_NODE}/{}
curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.27 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.28 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.31 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.01 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.03 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.04 
...

vsellier@esnode2 ~ % df -h /srv/elasticsearch      
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        5.5T  4.3T  1.3T  78% /srv/elasticsearch

vsellier@esnode3 ~ % df -h /srv/elasticsearch      
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      5.5T  4.6T  909G  84% /srv/elasticsearch

~ 200G were freed

2019 cleanup

vsellier@esnode3 ~ % curl -s http://${ES_NODE}/_cat/indices | awk '{print $3}' | grep "systemlogs.*2019" | sort  | wc -l
365

It's full year

vsellier@esnode3 ~ % curl -s http://${ES_NODE}/_cat/indices | awk '{print $3}' | grep "systemlogs.*2019" | sort | xargs -t -n1 -i{} -r curl -XDELETE http://${ES_NODE}/{}
curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.01.01 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.01.02 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.01.03 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.01.04 
...
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.12.28 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.12.29 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.12.30 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2019.12.31 
{"acknowledged":true}%

vsellier@esnode2 ~ % df -h /srv/elasticsearch      
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        5.5T  3.4T  2.2T  62% /srv/elasticsearch

vsellier@esnode3 ~ % df -h /srv/elasticsearch      
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      5.5T  3.8T  1.8T  68% /srv/elasticsearch

~ 1.8To were freed

vsellier updated the task description. (Show Details)Dec 15 2020, 11:32 AM

The shard allocation is reactivated, it should have enough free disk space to replicate all the shard on the 2 nodes.

vsellier@esnode3 ~ % curl -XPUT -H "Content-Type: application/json" http://$ES_NODE/_cluster/settings -d '{
"persistent": {
    "cluster.routing.allocation.enable": null
  }
}'
{"acknowledged":true,"persistent":{},"transient":{}}

vsellier updated the task description. (Show Details)Dec 15 2020, 11:37 AM

olasd renamed this task from Cluster failure during a rolling restart to Elasticsearch cluster failure during a rolling restart.Dec 15 2020, 3:24 PM

The free disk space is again around ~85% used on esnode3 (~79% on esnode2).
The systemlogs.*2020.01.* indices were removed.

1" | sort | xargs -t -n1 -i{} -r curl -XDELETE http://${ES_NODE}/{}
curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.01 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.02 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.03 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.04 
...
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.30 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2020.01.31 
{"acknowledged":true}%

vsellier@esnode2 ~ % df -h /srv/elasticsearch 
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        5.5T  3.9T  1.6T  71% /srv/elasticsearch

vsellier@esnode3 ~ % df -h /srv/elasticsearch 
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      5.5T  4.3T  1.2T  79% /srv/elasticsearch

We tried to temporarily restart esnode1 to reallocate the shards of the red indices for which esnode1 was the primary.
Actions:

Mount/Remount the xfs partition to flush the xfs journal
Perform a xfs_repair to ensure the fs is ok
configure elasticsearch deallocate the shard managed by esnode1
start esnode1
wait for the shards redistribution (swh_workers-2020.09.03was quickly recovered, and the remaining systemlogs.2018 deleted)
stop esnode1
disable puppet to avoid a restart of elasticsearch on esnode1

root@esnode1:~# umount /srv/elasticsearch 
root@esnode1:~# cat /proc/mdstat 
Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid0 sdc[1] sdb[0] sdd[2]
      5860150272 blocks super 1.2 512k chunks
      
unused devices: <none>

root@esnode1:~# xfs_repair /dev/md127 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

root@esnode1:~# mount /srv/elasticsearch/

# dmesg content
[77869.451143] XFS (md127): Mounting V5 Filesystem
[77869.655634] XFS (md127): Starting recovery (logdev: internal)
[77877.098250] XFS (md127): Ending recovery (logdev: internal)

root@esnode1:~# umount /srv/elasticsearch/

root@esnode1:~# xfs_repair /dev/md127
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- 16:33:14: scanning filesystem freespace - 32 of 32 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 16:33:14: scanning agi unlinked lists - 32 of 32 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 30
...
- agno = 14
- 16:33:51: process known inodes and inode discovery - 499200 of 499200 inodes done
- process newly discovered inodes...
- 16:33:51: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 16:33:51: setting up duplicate extent list - 32 of 32 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 0
...
- agno = 31
- 16:33:51: check for inodes claiming duplicate blocks - 499200 of 499200 inodes done
Phase 5 - rebuild AG headers and trees...
- 16:33:51: rebuild AG headers and trees - 32 of 32 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
- 16:34:24: verify and correct link counts - 32 of 32 allocation groups done
done

root@esnode1:~# mount /srv/elasticsearch/
root@esnode1:~# systemctl start elasticsearch

~/src/swh/puppet-environment/private/swh-private-data master ❯ curl -s http://${ES_NODE}/_cat/indices\?v | grep red                                                                                       17:51:32
red    close  systemlogs-2018.09.03               rp1wDYs1RKq_zFsBzJaRhQ   5   1                                                  
red    close  systemlogs-2018.09.04               aVaYePiuQA60vDykpcj-rg   5   1                                                  
red    close  systemlogs-2018.09.01               pc4J5b0kQq2uO-SBQgcFOw   5   1                                                  
red    close  systemlogs-2018.09.09               wdSJhSdWSMWJINub25bUdw   5   1                                                  
red    close  systemlogs-2018.09.08               _9h1FgpFTWK_X--GafCG3Q   5   1                                                  
red    close  systemlogs-2018.09.06               kLAdG46dTbaGJyPRMERkzQ   5   1                                                  
red    close  systemlogs-2018.09.14               VoV-pBiAQ0eSBz-NL08lDg   5   1                                                  
red    close  systemlogs-2018.09.15               ii9pPWqfSGCV03MXnenpmQ   5   1                                                  
red    close  systemlogs-2018.09.12               aJP8jPjfQuKEJVWg1oGBBw   5   1                                                  
red    close  systemlogs-2018.09.13               LpYRnaO-SuCicQqcKucitg   5   1                                                  
red    close  systemlogs-2018.09.10               mEpZ0-vnQVq2NFgE5N9IJw   5   1                                                  
red    close  systemlogs-2018.09.11               KmerHn_jQKSkHgeLpiBwAw   5   1                                                  
red    close  systemlogs-2018.08.22               2Ixph871QpGXQ2BYIEA-PQ   5   1                                                  
red    close  systemlogs-2018.08.25               XWG9xsDmTUiSwNfWJqi5ag   5   1                                                  
red    close  systemlogs-2018.08.27               5wi2zryCThSfPLZ2Gf6GOQ   5   1                                                  
red    close  systemlogs-2018.08.28               xX8LQDUkRNS-f6w2Cqi9EQ   5   1                                                  
red    close  systemlogs-2018.08.31               dc2IGw_LQPy0gTxmF2Wu8w   5   1       

~/src/swh/puppet-environment/private/swh-private-data master ❯ curl -s http://${ES_NODE}/_cat/indices\?v | grep red | awk '{print $3}' | xargs -t -i{} curl -XDELETE http://${ES_NODE}/{}                 17:54:22
curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.03 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.04 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.01 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.09 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.08 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.06 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.14 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.15 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.12 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.13 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.10 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.09.11 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.22 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.25 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.27 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.28 
{"acknowledged":true}curl -XDELETE http://192.168.100.63:9200/systemlogs-2018.08.31 
{"acknowledged":true}%

vsellier updated the task description. (Show Details)Dec 15 2020, 6:16 PM

smartctl extended test are running on all the esnode* disks to detect possible defects. The results will be availble in few hours

root@esnode*:~# smartctl -t long /dev/sd[a-d]

ardumont added a revision: D4747: Decomission kafka from esnodes.Dec 15 2020, 6:32 PM

Remark regarding the extension of the storage via the addition of a new data directory [1], so not sure it's the best way to do it:

Elasticsearch does not balance shards across a node’s data paths. High disk usage in a single path can trigger a high disk usage watermark for the entire node. > If triggered, Elasticsearch will not add shards to the node, even if the node’s other paths have available disk space. If you need additional disk space, we recommend you add a new node rather than additional data paths.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/important-settings.html#path-settings

In T2888#55002, @vsellier wrote:

smartctl extended test are running on all the esnode* disks to detect possible defects. The results will be availble in few hours

All the smartctl tests are done and no additional faulty disks were detected

vsellier updated the task description. (Show Details)Dec 16 2020, 10:22 AM

ardumont updated the task description. (Show Details)Dec 17 2020, 11:01 AM

ardumont mentioned this in T2898: Sentry: Increase disk space.

ardumont added a revision: D4757: Decomission zookeeper instances used in the rocquencourt_legacy cluster.Dec 17 2020, 11:52 AM

ardumont updated the task description. (Show Details)Dec 17 2020, 12:36 PM

ardumont mentioned this in rSPSITE6656159538c6: Decomission kafka from esnodes.Dec 17 2020, 5:46 PM

ardumont added a commit: rSPSITE779c7f2155d6: Decomission zookeeper instances used in the rocquencourt_legacy cluster.

Clean up some more space to decrease the icinga noises.

Before:

$ ssh esnode3 df -h | grep elastic
/dev/md127      5.5T  4.4T  1.1T  81% /srv/elasticsearch
$ ssh esnode2 df -h | grep elastic
/dev/md0        5.5T  4.4T  1.1T  81% /srv/elasticsearch

Clean up:

curl -s http://${ES_NODE}/_cat/indices | grep systemlogs | grep 2020.02 | awk '{print $3}' | xargs -t -i{} -r curl -s -XDELETE http://${ES_NODE}/{}
curl -s -XDELETE http://192.168.100.63:9200/systemlogs-2020.02.18
{"acknowledged":true}curl -s -XDELETE http://192.168.100.63:9200/systemlogs-2020.02.19
{"acknowledged":true}curl -s -XDELETE http://192.168.100.63:9200/systemlogs-2020.02.12
{"acknowledged":true}curl -s -XDELETE http://192.168.100.63:9200/systemlogs-2020.02.13
...
{"acknowledged":true}curl -s -XDELETE http://192.168.100.63:9200/systemlogs-2020.02.20
{"acknowledged":true}%

After:

$ ssh esnode2 df -h | grep elastic
/dev/md0        5.5T  4.3T  1.3T  78% /srv/elasticsearch
$ ssh esnode3 df -h | grep elastic
/dev/md127      5.5T  4.3T  1.3T  78% /srv/elasticsearch

ardumont updated the task description. (Show Details)Dec 17 2020, 6:19 PM

ardumont mentioned this in rSPREbd9a300acef2: Drop zookeeper reference.Dec 17 2020, 6:43 PM

Nodes esnodes and zookeeper cleaned up.
zookeeper nodes destroyed (and puppet resources cleaned up).

We should be able to reclaim the empty space from the /srv/kafka (2To) to give back somehow to es.

[1] clean up script got improved following an irc discussion [2]

root@pergamon:~# swh-puppet-master-clean-certificate zookeeper1.internal.softwareheritage.org zookeeper2.internal.softwareheritage.org zookeeper3.internal.softwareheritage.org
+ puppet node clean zookeeper1.internal.softwareheritage.org zookeeper2.internal.softwareheritage.org zookeeper3.internal.softwareheritage.org
Notice: Revoked certificate with serial 222
Notice: Removing file Puppet::SSL::Certificate zookeeper1.internal.softwareheritage.org at '/var/lib/puppet/ssl/ca/signed/zookeeper1.internal.softwareheritage.org.pem'
Notice: Revoked certificate with serial 223
Notice: Removing file Puppet::SSL::Certificate zookeeper2.internal.softwareheritage.org at '/var/lib/puppet/ssl/ca/signed/zookeeper2.internal.softwareheritage.org.pem'
Notice: Revoked certificate with serial 221
Notice: Removing file Puppet::SSL::Certificate zookeeper3.internal.softwareheritage.org at '/var/lib/puppet/ssl/ca/signed/zookeeper3.internal.softwareheritage.org.pem'
zookeeper1.internal.softwareheritage.org
zookeeper2.internal.softwareheritage.org
zookeeper3.internal.softwareheritage.org
+ puppet cert clean zookeeper1.internal.softwareheritage.org zookeeper2.internal.softwareheritage.org zookeeper3.internal.softwareheritage.org
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Revoked certificate with serial 222
Notice: Revoked certificate with serial 223
Notice: Revoked certificate with serial 221
+ systemctl restart apache2

[2]

Notice: /Stage[main]/Profile::Puppet::Master/File[/usr/local/sbin/swh-puppet-master-clean-certificate]/content:
--- /usr/local/sbin/swh-puppet-master-clean-certificate 2019-07-29 13:00:30.390819275 +0000
+++ /tmp/puppet-file20201218-516965-17h5wb5     2020-12-18 10:19:32.300884112 +0000
@@ -1,14 +1,13 @@
 #!/usr/bin/env bash

 # Use:
-# $0 CERTNAME
+# $0 CERTNAME ...

 # Example:
-# $0 storage0.internal.staging.swh.network
+# $0 storage0.internal.staging.swh.network db0.internal.staging.swh.network

 set -x

-CERTNAME=$1
-puppet node deactivate $CERTNAME
-puppet cert clean $CERTNAME
+puppet node clean $@
+puppet cert clean $@
 systemctl restart apache2

Info: Computing checksum on file /usr/local/sbin/swh-puppet-master-clean-certificate
Info: /Stage[main]/Profile::Puppet::Master/File[/usr/local/sbin/swh-puppet-master-clean-certificate]: Filebucketed /usr/local/sbin/swh-puppet-master-clean-certificate to puppet with sum 817ded4094b7bc2f6f4a6487a467be76
Notice: /Stage[main]/Profile::Puppet::Master/File[/usr/local/sbin/swh-puppet-master-clean-certificate]/content: content changed '{md5}817ded4094b7bc2f6f4a6487a467be76' to '{md5}1e7ca917d6b9706caf3eb5863153e36a'
Notice: Applied catalog in 35.61 seconds

ardumont lowered the priority of this task from Unbreak Now! to High.Dec 18 2020, 10:37 AM

Unbreak Now! to High.

Because the cluster is green again.

It's in degraded mode since we are missing one node (esnode1).

There should be a maintenance operation to rack the 2 new disks received (to replace the 2 deads) at some point (waiting on DSI for a ping back for that).

We will run some more tests on esnode1 regarding the raid configuration now that it's out of the cluster and its current raid is dead.

vsellier mentioned this in T2903: Test different disk configuration on esnode1.Dec 21 2020, 9:30 AM

The disks can't be replaced before beginning of January because of a closed logistic service
Dell was notified about the delay for the disk replacement. The next package retrieval attempt by UPS is scheduled for the *2020-01-11*

vsellier changed the status of subtask T2903: Test different disk configuration on esnode1 from Open to Work in Progress.Dec 22 2020, 12:08 PM

ardumont moved this task from Backlog to in-progress on the System administration board.Jan 5 2021, 11:56 AM

The 2 disks were replaced :

root@esnode1:~# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000NM012A-2MP130
Serial Number:    WJC054FE
LU WWN Device Id: 5 000c50 0ccd5b501
Add. Product Id:  DELL(tm)
Firmware Version: CAJ8
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  5 14:37:13 2021 UTC
...

root@esnode1:~# smartctl -a /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000NM012A-2MP130
Serial Number:    WJC04T8W
LU WWN Device Id: 5 000c50 0ccd995bd
Add. Product Id:  DELL(tm)
Firmware Version: CAJ8
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan  5 14:46:52 2021 UTC

A smart test is in progress before adding them to the zfs pool.

The old discs are packed and ready to be picked up by UPS on the 2020-01-11.

The new disk are ok according the smart test:

root@esnode1:~# echo /dev/sd{b,c} | xargs -n1 smartctl -a | grep -A2 "Self-test log"
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
--
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -

They can be partitioned and added to the zfs pool:

sdb :

root@esnode1:~# sfdisk -d /dev/sda | sfdisk -f /dev/sdb
Checking that no-one is using this disk right now ... OK

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000NM012A-2MP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 543964DA-9ECA-4222-952D-BA8A90FAB2B9).
/dev/sdb1: Created a new partition 1 of type 'EFI System' and of size 512 MiB.
/dev/sdb2: Created a new partition 2 of type 'Linux filesystem' and of size 37.3 GiB.
/dev/sdb3: Created a new partition 3 of type 'Linux swap' and of size 29.8 GiB.
/dev/sdb4: Created a new partition 4 of type 'Linux filesystem' and of size 1.8 TiB.
/dev/sdb5: Done.

New situation:
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9

Device         Start        End    Sectors  Size Type
/dev/sdb1       2048    1050623    1048576  512M EFI System
/dev/sdb2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sdb3   79175680  141676543   62500864 29.8G Linux swap
/dev/sdb4  141676544 3907028991 3765352448  1.8T Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

sdc :

root@esnode1:~# sfdisk -d /dev/sda | sfdisk -f /dev/sdc
Checking that no-one is using this disk right now ... OK

Disk /dev/sdc: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: ST2000NM012A-2MP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 543964DA-9ECA-4222-952D-BA8A90FAB2B9).
/dev/sdc1: Created a new partition 1 of type 'EFI System' and of size 512 MiB.
/dev/sdc2: Created a new partition 2 of type 'Linux filesystem' and of size 37.3 GiB.
/dev/sdc3: Created a new partition 3 of type 'Linux swap' and of size 29.8 GiB.
/dev/sdc4: Created a new partition 4 of type 'Linux filesystem' and of size 1.8 TiB.
/dev/sdc5: Done.

New situation:
Disklabel type: gpt
Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9

Device         Start        End    Sectors  Size Type
/dev/sdc1       2048    1050623    1048576  512M EFI System
/dev/sdc2    1050624   79175679   78125056 37.3G Linux filesystem
/dev/sdc3   79175680  141676543   62500864 29.8G Linux swap
/dev/sdc4  141676544 3907028991 3765352448  1.8T Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

Before adding them to the pool:

root@esnode1:~# df -h /srv/elasticsearch/nodes
Filesystem          Size  Used Avail Use% Mounted on
elasticsearch-data  3.4T  2.9T  586G  84% /srv/elasticsearch/nodes

adding the disks to the pool :

root@esnode1:~# zpool add elasticsearch-data /dev/sdb4
root@esnode1:~# zpool add elasticsearch-data /dev/sdc4

after :

root@esnode1:~# df -h /srv/elasticsearch/nodes
Filesystem          Size  Used Avail Use% Mounted on
elasticsearch-data  6.8T  2.9T  4.0T  42% /srv/elasticsearch/nodes

root@esnode1:~# zpool status elasticsearch-data
  pool: elasticsearch-data
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(5) for details.
  scan: none requested
config:

	NAME                STATE     READ WRITE CKSUM
	elasticsearch-data  ONLINE       0     0     0
	  sda4              ONLINE       0     0     0
	  sdd4              ONLINE       0     0     0
	  sdb4              ONLINE       0     0     0
	  sdc4              ONLINE       0     0     0

errors: No known data errors

For the record, a benchmark is launched on the zfs pool with bonnie++ :

% /usr/sbin/bonnie++ -m esnode1-zfs

let's wait for the results before restarting elasticsearch

vsellier added a comment.Jan 6 2021, 3:42 PM

This comment was removed by vsellier.

vsellier updated the task description. (Show Details)Jan 6 2021, 3:43 PM

previous comment moved to T2903#56023

Reducing priority to normal as there is no more risks for the data

vsellier updated the task description. (Show Details)Jan 12 2021, 10:46 AM

vsellier updated the task description. (Show Details)

The actions to replace the disk on esnode1 and stabilize the cluster are done, so the state of this task can be changed to resolved.
The other remaining task will be done in dedicated ones.

vsellier closed subtask T2903: Test different disk configuration on esnode1 as Resolved.Jan 13 2021, 9:20 AM

vsellier changed the status of subtask T2958: Use all the disks on esnode2 and esnode3 from Open to Work in Progress.

vsellier mentioned this in T2975: Disk replacement on esnode1.Jan 18 2021, 7:02 PM

vsellier closed subtask T2958: Use all the disks on esnode2 and esnode3 as Resolved.Jan 19 2021, 8:54 AM

vsellier moved this task from in-progress to deployed/landed/monitoring on the System administration board.Jan 21 2021, 9:46 AM

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.Jul 29 2021, 1:21 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T2903: Test different disk configuration on esnode1 from Resolved to Migrated.Oct 19 2022, 6:00 PM

gitlab-migration changed the status of subtask T2958: Use all the disks on esnode2 and esnode3 from Resolved to Migrated.

gitlab-migration closed subtask T2960: Add disk health monitoring as Migrated.

rSPSITE puppet-swh-site
	Closed		D4747 Decomission kafka from esnodes
		D4757	rSPSITE779c7f2155d6 Decomission zookeeper instances used in the rocquencourt_legacy cluster

Elasticsearch cluster failure during a rolling restart
Closed, MigratedEdits Locked
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

cleanup of systemlogs index before 2020 (2018/2019)

disk space before

2018 cleanup

2019 cleanup

Elasticsearch cluster failure during a rolling restartClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

cleanup of systemlogs index before 2020 (2018/2019)

disk space before

2018 cleanup

2019 cleanup

Elasticsearch cluster failure during a rolling restart
Closed, MigratedEdits Locked
Actions

Related Objects
Search...