Page MenuHomeSoftware Heritage

Replace out of order disks on db1.staging and storage1.staging
Closed, MigratedEdits Locked

Description

There is at least one disk on each server having dead sectors.
They should be still covered by the manufacturer warranty.

Almost dead disk on db1.staging :
(suspicious) /dev/sda
Model Family: Seagate Enterprise Capacity 3.5 HDD
Device Model: ST6000NM0115-1YZ110
Serial Number: ZAD27CCS
(failing) /dev/sdb
Model Family: Seagate Enterprise Capacity 3.5 HDD
Device Model: ST6000NM0115-1YZ110
Serial Number: ZAD27C4P

Almost dead disk on storage1.staging:
/dev/sda
Model Family: Seagate Enterprise Capacity 3.5 HDD
Device Model: ST6000NM0115-1YZ110
Serial Number: ZAD0S2NR

/dev/sdb
Model Family: Seagate Enterprise Capacity 3.5 HDD
Device Model: ST6000NM0115-1YZ110
Serial Number: ZAD0SD5D

Event Timeline

vsellier changed the task status from Open to Work in Progress.Jan 7 2021, 12:02 PM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier moved this task from Backlog to in-progress on the System administration board.

Complete disk statuses :

  • db1.staging:
root@db1:~# ls  /dev/sd{a..n} | xargs -t -n1 smartctl -a | grep -e "/dev/sd?" -e Reallocated_Sector_Ct -e "Model Family" -e "Serial Number" -e "Reported_Uncorrect" -e lifetime -e "Extended offline" -e "Offline_Uncorrectable"
smartctl -a /dev/sda 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27CCS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdb 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27C4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
# 1  Extended offline    Completed: read failure       70%     25421         4131034152
smartctl -a /dev/sdc 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DW0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdd 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A44
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sde 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27BA5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdf 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DCG
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdg 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD270KS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdh 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdi 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27E48
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdj 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD26YN2
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdk 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279XY
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdl 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279ZX
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25427         -
smartctl -a /dev/sdm 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV71810017150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
smartctl -a /dev/sdn 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV718004DM150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
  • storage1.staging:
root@storage1:~# ls  /dev/sd{a..n} | xargs -t -n1 smartctl -a | grep -e "/dev/sd?" -e Reallocated_Sector_Ct -e "Model Family" -e "Serial Number" -e "Reported_Uncorrect" -e lifetime -e "Extended offline" -e "Offline_Uncorrectable"
smartctl -a /dev/sda 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0S2NR
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
Error 1 occurred at disk power-on lifetime: 25327 hours (1055 days + 7 hours)
smartctl -a /dev/sdb 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0SD5D
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdc 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0RCAZ
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdd 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0RZ24
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sde 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0SFYJ
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdf 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0SCMM
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdg 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0S1H5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdh 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0S1L9
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdi 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0S1KC
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdj 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0SFLG
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdk 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0S6L9
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdl 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD0SDDK
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
smartctl -a /dev/sdm 
Serial Number:    PHYS7326018M240AGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
smartctl -a /dev/sdn 
Serial Number:    PHYS7326018Q240AGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

there are 2 suspicious disks on db1 and 1 on storage1.
The smart test was never executed on storage1. I will launch one on both servers to have fresh data

Tests launched :

root@db1:~# echo /dev/sd{a..n} | xargs -t -n1 smartctl -t long
smartctl -t long /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 613 minutes for test to complete.
Test will complete after Thu Jan  7 21:33:08 2021
...
root@storage1:~#  echo /dev/sd{a..n} | xargs -t -n1 smartctl -t long
smartctl -t long /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 523 minutes for test to complete.
Test will complete after Thu Jan  7 20:03:52 2021
...
vsellier updated the task description. (Show Details)

the test is still running on one disk on storage1 (sdb). No new errors were discovered on all the other disk

The test of /dev/sdb finally ends ... in error :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%     26004         2559298584

So we have 2 disks to replace on each server. What's weird is that the 2 disks to replace are at the same position on each server...

The model number to use on the request is : ST6000NM0115
There is an obscure message limiting the number of return / country / year to 3 (!):

The priority is storage1 as there is one offline disk and only one remaining spare. So its 2 disks will be replaced.
For db1, /dev/sdb (ZAD27C4P) will be replaced as it's a spare and it will be simpler to replace.


  • db1:
root@db1:~# zpool status data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:16:01 with 0 errors on Sun Jan 10 00:40:03 2021
config:

	NAME                                           STATE     READ WRITE CKSUM
	data                                           ONLINE       0     0     0
	  mirror-0                                     ONLINE       0     0     0
	    wwn-0x5000c500a44dff5d                     ONLINE       0     0     0    <--- (sda) Reallocated sectors
	    wwn-0x5000c500a44dcf3b                     ONLINE       0     0     0
	  mirror-1                                     ONLINE       0     0     0
	    wwn-0x5000c500a44cf496                     ONLINE       0     0     0
	    wwn-0x5000c500a44ef795                     ONLINE       0     0     0
	  mirror-2                                     ONLINE       0     0     0
	    wwn-0x5000c500a44dcd4e                     ONLINE       0     0     0
	    wwn-0x5000c500a447fa59                     ONLINE       0     0     0
	  mirror-3                                     ONLINE       0     0     0
	    wwn-0x5000c500a44e4854                     ONLINE       0     0     0
	    wwn-0x5000c500a44d8240                     ONLINE       0     0     0
	  mirror-4                                     ONLINE       0     0     0
	    wwn-0x5000c500a44f02c4                     ONLINE       0     0     0
	    wwn-0x5000c500a44ee54b                     ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPEDMD800G4_CVFT6484007J800CGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500a44ee887                       AVAIL     <--- (sdb) Current_Pending_sector / Offline Uncorrectable
	  wwn-0x5000c500a44f0709                       AVAIL   

errors: No known data errors
root@db1:~# ls -al /dev/disk/by-id/ | grep "wwn.*sda$"
lrwxrwxrwx 1 root root    9 Dec 31 15:50 wwn-0x5000c500a44dff5d -> ../../sda
root@db1:~# ls -al /dev/disk/by-id/ | grep "wwn.*sdb$"
lrwxrwxrwx 1 root root    9 Dec 31 15:50 wwn-0x5000c500a44ee887 -> ../../sdb
  • storage1
root@storage1:~# zpool status data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 0 days 03:01:30 with 0 errors on Sun Jan 10 03:25:32 2021
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e3868                      ONLINE       0     0     0     <-- (sdb) Current_Pending_sector / Offline Uncorrectable
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    wwn-0x5000c500a22f48c9                      ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      ONLINE       0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    wwn-0x5000c500a23e4511                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500a23d19b6                        AVAIL   

errors: No known data errors
root@storage1:~# ls -al /dev/disk/by-id/ | grep "wwn.*sda$"
lrwxrwxrwx 1 root root    9 Dec 31 15:52 wwn-0x5000c500a22ebed5 -> ../../sda  <--- Offline disk
root@storage1:~# ls -al /dev/disk/by-id/ | grep "wwn.*sdb$"
lrwxrwxrwx 1 root root    9 Dec 31 15:52 wwn-0x5000c500a23e3868 -> ../../sdb

well well well

It seems nothing is done to have an easy warranty between the site that works when it has time and the replacement conditions :

It seems we will have to schedule the replacements progressively.

vlorentz shifted this object from the S1 Public space to the Restricted Space space.Jan 12 2021, 11:37 AM
vlorentz shifted this object from the Restricted Space space to the S1 Public space.

Precision around these disks replacement, even if the disks are in error, there is still spares on the zfs pool:

  • db1 :
root@db1:~# zpool status data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:16:01 with 0 errors on Sun Jan 10 00:40:03 2021
config:

	NAME                                           STATE     READ WRITE CKSUM
	data                                           ONLINE       0     0     0
	  mirror-0                                     ONLINE       0     0     0
	    wwn-0x5000c500a44dff5d                     ONLINE       0     0     0  <---- TO REPLACE
	    wwn-0x5000c500a44dcf3b                     ONLINE       0     0     0
	  mirror-1                                     ONLINE       0     0     0
	    wwn-0x5000c500a44cf496                     ONLINE       0     0     0
	    wwn-0x5000c500a44ef795                     ONLINE       0     0     0
	  mirror-2                                     ONLINE       0     0     0
	    wwn-0x5000c500a44dcd4e                     ONLINE       0     0     0
	    wwn-0x5000c500a447fa59                     ONLINE       0     0     0
	  mirror-3                                     ONLINE       0     0     0
	    wwn-0x5000c500a44e4854                     ONLINE       0     0     0
	    wwn-0x5000c500a44d8240                     ONLINE       0     0     0
	  mirror-4                                     ONLINE       0     0     0
	    wwn-0x5000c500a44f02c4                     ONLINE       0     0     0
	    wwn-0x5000c500a44ee54b                     ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPEDMD800G4_CVFT6484007J800CGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500a44ee887                       AVAIL   <---- TO REPLACE
	  wwn-0x5000c500a44f0709                       AVAIL   

errors: No known data errors
  • storage1
root@storage1:~# zpool status data
  pool: data
 state: ONLINE
  scan: scrub repaired 0B in 0 days 03:01:30 with 0 errors on Sun Jan 10 03:25:32 2021
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e3868                      ONLINE       0     0     0    <----- TO REPLACE
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    wwn-0x5000c500a22f48c9                      ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      ONLINE       0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    wwn-0x5000c500a23e4511                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500a23d19b6                        AVAIL   

errors: No known data errors

+ another disk removed from the pool which we will replace first in T3033

storage disks will be replaced in T3243

closing this issue as the quota of disks replacement will be over after T3243.
The disk on db1 looks stable for the moment. It will be removed from the zfs pool in case of problems.