Page MenuHomeSoftware Heritage

staging - Disk errors on storage1
Closed, MigratedEdits Locked

Description

A new disk error was detected by zfs on storage1

root@storage1:~# zpool status -v
  pool: data
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 830G in 11:21:40 with 0 errors on Sun Jun 13 12:33:12 2021
remove: Removal of vdev 1 copied 513G in 3h57m, completed on Mon Apr 12 16:15:40 2021
    108M memory used for removed device mappings
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            DEGRADED     0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    wwn-0x5000c500a23d19b6                      ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      DEGRADED     0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    spare-1                                     DEGRADED     6     0     0
	      wwn-0x5000c500a23e4511                    DEGRADED    15     2    49  too many errors
	      wwn-0x5000c500c4be3956                    ONLINE       0     0 3.45K
	  mirror-5                                      ONLINE       0     0     0
	    wwn-0x5000c500d5dda886                      ONLINE       0     0     0
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500c4be3956                        INUSE     currently in use
	  wwn-0x5000c500d5de652a                        AVAIL   

errors: No known data errors

The disk has several dead sectors:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   064   044    Pre-fail  Always       -       49244880
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       16
  7 Seek_Error_Rate         0x000f   094   060   045    Pre-fail  Always       -       2591822807
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       29725 (204 244 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   063   040    Old_age   Always       -       30 (Min/Max 28/37)
191 G-Sense_Error_Rate      0x0032   097   097   000    Old_age   Always       -       6251
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       282
193 Load_Cycle_Count        0x0032   073   073   000    Old_age   Always       -       55159
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   028   001   000    Old_age   Always       -       49244880
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       21543h+27m+28.037s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       19384770747
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       123220748329

Event Timeline

vsellier created this task.

Serial number:

root@storage1:~# smartctl -a /dev/disk/by-id/wwn-0x5000c500a23e4511 | grep -B1 "Serial Number"
Device Model:     ST6000NM0115-1YZ110
Serial Number:    ZAD0SDDK

Out of seagate site, the disk is still under warranty [1]:

Your Product Exos 7E
Model Number ST6000NM0115
Serial Number ZAD0SDDK
Warranty Valid Until February 24, 2022

Regional warranty restrictions apply [2]

[1] https://www.seagate.com/support/warranty-and-replacements/

[2]

This product was originally sold in a different region. Warranty claims may not be accepted for products returned outside the country/region where the product was first shipped to a Seagate Authorized Distributor.

A priori, zfs did itself the replace action, selecting one spare disk in the disks spare pool.
We needed only to detach the failing one, it's done. [1]

We have 1 spare disk left now. The failed one is now out of the zfs pool.

[1]

root@storage1:~# zpool detach data wwn-0x5000c500a23e4511
root@storage1:~# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data                                            27.3T  3.25T  24.0T        -         -    19%    11%  1.00x    ONLINE  -
  mirror                                        5.45T   798G  4.67T        -         -    24%  14.3%      -  ONLINE
    wwn-0x5000c500a22eef5e                          -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500a23e85cf                          -      -      -        -         -      -      -      -  ONLINE
  mirror                                        5.45T   796G  4.68T        -         -    24%  14.3%      -  ONLINE
    wwn-0x5000c500a23d19b6                          -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500a22ef2c4                          -      -      -        -         -      -      -      -  ONLINE
  mirror                                        5.45T   799G  4.67T        -         -    24%  14.3%      -  ONLINE
    wwn-0x5000c500a23e7af4                          -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500a23d253b                          -      -      -        -         -      -      -      -  ONLINE
  mirror                                        5.45T   800G  4.67T        -         -    24%  14.3%      -  ONLINE
    wwn-0x5000c500a23cf9ba                          -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500c4be3956                          -      -      -        -         -      -      -      -  ONLINE
  mirror                                        5.45T   139G  5.32T        -         -     3%  2.48%      -  ONLINE
    wwn-0x5000c500d5dda886                          -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500a22eed6f                          -      -      -        -         -      -      -      -  ONLINE
cache                                               -      -      -        -         -      -      -      -  -
  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN   349G   305G  44.1G        -         -     0%  87.4%      -  ONLINE
spare                                               -      -      -        -         -      -      -      -  -
  wwn-0x5000c500d5de652a                            -      -      -        -         -      -      -      -  AVAIL
ardumont changed the task status from Open to Work in Progress.Jun 14 2021, 3:01 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

This is the disk position according to a picture of the server taken by Christophe :

The led will be turn off to avoid raising catching his attention:

root@storage1:~# ls -al /dev/disk/by-id | grep wwn-0x5000c500a23e4511
lrwxrwxrwx 1 root root    9 Jun 13 12:12 wwn-0x5000c500a23e4511 -> ../../sdl
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a23e4511-part1 -> ../../sdl1
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a23e4511-part9 -> ../../sdl9

root@storage1:~# ledctl off=/dev/sdl