Page MenuHomeSoftware Heritage

Disk replacement on esnode1
Closed, MigratedEdits Locked

Description

Some errors are detected on a disk of esnode1. It's one of the disks replaced recently (T2888). It's power on time is 12d.
smartmon data :

root@esnode1:~# smartctl -a /dev/sdb
=== START OF INFORMATION SECTION ===
Device Model:     ST2000NM012A-2MP130
Serial Number:    WJC054FE
LU WWN Device Id: 5 000c50 0ccd5b501
Add. Product Id:  DELL(tm)
Firmware Version: CAJ8
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan 18 17:57:40 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x010f   082   064   044    Pre-fail  Always       -       172358196
  3 Spin_Up_Time            0x0103   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0133   100   100   010    Pre-fail  Always       -       5
  7 Seek_Error_Rate         0x000f   075   060   045    Pre-fail  Always       -       35130384
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       315
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       1
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   060   040    Old_age   Always       -       36 (Min/Max 21/40)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       26
194 Temperature_Celsius     0x0022   036   040   000    Old_age   Always       -       36 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       172358196
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       311 (175 216 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       6932528030
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3982842720
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -

Event Timeline

vsellier triaged this task as Normal priority.Jan 18 2021, 7:02 PM
vsellier created this task.
vsellier moved this task from Backlog to Weekly backlog on the System administration board.
vsellier changed the task status from Open to Work in Progress.Jan 28 2021, 3:44 PM
vsellier moved this task from Weekly backlog to in-progress on the System administration board.

Ticket opened via the dell support.
The disk should be delivered the Monday 1st February 2021, the DSI is informed

esnode1 unallocation started :

❯ export ES_NODE=192.168.100.61:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.61"
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.61"
          }
        }
      }
    }
  }
}

esnode1 is ready to be stopped :

❯ curl -s http://$ES_NODE/_cat/allocation\?v\&s=node                                                                                                             18:07:54
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  1482                                                                                         UNASSIGNED
     0           0b     1.7tb        5tb      6.7tb           25 192.168.100.61 192.168.100.61 esnode1
  3767        3.7tb     3.7tb        3tb      6.7tb           55 192.168.100.62 192.168.100.62 esnode2
  3713        3.6tb     3.6tb      3.1tb      6.7tb           54 192.168.100.63 192.168.100.63 esnode3

It will be left in the cluster until the work starts to keep 3 voting nodes in case of a problem on the other nodes in the interval.

The disk is replaced :

# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MG04ACA200NY
Serial Number:    Z0R3K6ZLF7EE
LU WWN Device Id: 5 000039 a7b800cf5
Add. Product Id:  DELL(tm)
Firmware Version: FK5D
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb  2 16:50:51 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   100   100   000    Old_age   Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       3390
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 21/32)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       0
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       8510

A long smart test is launched to perform a check

  • partition recreated :
# sfdisk -d /dev/sda | sfdisk -f /dev/sdb
  • zfs pool recreated with the wwn ids :
root@esnode1:/etc/zfs# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4)
root@esnode1:/etc/zfs# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-data     7T   152K  7.00T        -         -     0%     0%  1.00x    ONLINE  -
  • server restarted to check everything is ok
  • allocation reactivated :
❯ export ES_NODE=192.168.100.61:9200 
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{                                                       18:11:28
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : null
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : { }
}
  • and in progress :
 ❯ curl -s http://$ES_NODE/_cat/health\?v; echo; curl -s http://$ES_NODE/_cat/allocation\?v\&s=node                                                       18:12:47
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1612285969 17:12:49  swh-logging-prod green           3         3   8974 4487    2    0        0             0                  -                100.0%

shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
     5        2.5gb     2.5gb      6.7tb      6.7tb            0 192.168.100.61 192.168.100.61 esnode1
  4485        4.5tb     4.5tb      2.2tb      6.7tb           67 192.168.100.62 192.168.100.62 esnode2
  4484        4.5tb     4.5tb      2.2tb      6.7tb           67 192.168.100.63 192.168.100.63 esnode3

So far so good, the smart test is done and didn't find any errors :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         9         -

The shard allocation is done the the task can be marked as resolved.