Page MenuHomeSoftware Heritage

staging: Disk error on storage1
Closed, MigratedEdits Locked

Description

An error was detected on one of the disks of storage1:

root@storage1:~# zpool status data
  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: resilvered 536G in 07:25:27 with 0 errors on Sun Apr 11 11:08:26 2021
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            DEGRADED     0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e3868                      ONLINE       0     0     0
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	  mirror-2                                      DEGRADED     0     0     0
	    spare-0                                     DEGRADED     0     0     0
	      wwn-0x5000c500a22f48c9                    FAULTED     56     1    21  too many errors
	      wwn-0x5000c500a23d19b6                    ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      ONLINE       0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    wwn-0x5000c500a23e4511                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500a23d19b6                        INUSE     currently in use
	  wwn-0x5000c500c4be3956                        AVAIL

Action:

  • stabilize the zfs pool
  • check if the disk can be replaced (it remains a quota of 2 disks according to the seagate rules)

Event Timeline

vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Apr 12 2021, 12:09 PM
vsellier claimed this task.

The new failing drive is /dev/sdc

root@storage1:~# ls -al /dev/disk/by-id/ | grep wwn-0x5000c500a22f48c9
lrwxrwxrwx 1 root root    9 Apr 11 03:42 wwn-0x5000c500a22f48c9 -> ../../sdc
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part1 -> ../../sdc1
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part9 -> ../../sdc9

root@storage1:~# ls -al /dev/disk/by-id/ | grep wwn-0x5000c500a22f48c9
lrwxrwxrwx 1 root root    9 Apr 11 03:42 wwn-0x5000c500a22f48c9 -> ../../sdc
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part1 -> ../../sdc1
lrwxrwxrwx 1 root root   10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part9 -> ../../sdc9
root@storage1:~# smartctl -a /dev/disk/by-id/wwn-0x5000c500a22f48c9
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST6000NM0115-1YZ110
Serial Number:    ZAD0RCAZ
LU WWN Device Id: 5 000c50 0a22f48c9
Firmware Version: SN02
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 12 10:44:52 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 562) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x70bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       66889024
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   096   060   045    Pre-fail  Always       -       8275086415
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28216 (247 223 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   067   040    Old_age   Always       -       29 (Min/Max 27/33)
191 G-Sense_Error_Rate      0x0032   098   098   000    Old_age   Always       -       5919
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       91
193 Load_Cycle_Count        0x0032   057   057   000    Old_age   Always       -       87489
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 10 0 0 0)
195 Hardware_ECC_Recovered  0x001a   001   001   000    Old_age   Always       -       66889024
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       17147h+14m+34.441s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       15718194971
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       116945196920

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 28185 hours (1174 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  44d+01:37:51.537  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  44d+01:37:48.673  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00  44d+01:37:48.672  READ FPDMA QUEUED
  60 00 d0 ff ff ff 4f 00  44d+01:37:48.671  READ FPDMA QUEUED
  60 00 b8 ff ff ff 4f 00  44d+01:37:48.669  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     25946         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It was replace by sdj

The failing disk was removed from the pool:

root@storage1:~# zpool detach data wwn-0x5000c500a22f48c9

root@storage1:~# zpool status
  pool: data
 state: ONLINE
  scan: resilvered 536G in 07:25:27 with 0 errors on Sun Apr 11 11:08:26 2021
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e3868                      ONLINE       0     0     0
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    wwn-0x5000c500a23d19b6                      ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      ONLINE       0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    wwn-0x5000c500a23e4511                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500c4be3956                        AVAIL   

errors: No known data errors

There are 2 disks with errors that should now be replaced:

  • /dev/sdb/wwn-0x5000c500a23e3868 An old one
  • /dev/sdc/wwn-0x5000c500a22f48c9 the disk just removed from the pool

/dev/sdb must be removed from the pool to be able to replace it.
To avoid using the last spare, the mirror-1 can be removed the pool and the disk wwn-0x5000c500a22eed6f declared as a spare.
It will secure the data during the replacement.
It's possible as there is plenty of free space remaining on the server:

root@storage1:~# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data                                            27.3T  2.50T  24.8T        -         -    19%     9%  1.00x    ONLINE  -
  mirror                                        5.45T   512G  4.95T        -         -    19%  9.17%      -  ONLINE  
    wwn-0x5000c500a22eef5e                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e85cf                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   513G  4.95T        -         -    20%  9.18%      -  ONLINE  
    wwn-0x5000c500a23e3868                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a22eed6f                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   513G  4.95T        -         -    20%  9.17%      -  ONLINE  
    wwn-0x5000c500a23d19b6                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a22ef2c4                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   513G  4.95T        -         -    19%  9.17%      -  ONLINE  
    wwn-0x5000c500a23e7af4                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23d253b                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   513G  4.95T        -         -    20%  9.18%      -  ONLINE  
    wwn-0x5000c500a23cf9ba                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e4511                          -      -      -        -         -      -      -      -  ONLINE  
cache                                               -      -      -        -         -      -      -      -  -
  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN   349G   175G   175G        -         -     0%  50.0%      -  ONLINE  
spare                                               -      -      -        -         -      -      -      -  -
  wwn-0x5000c500c4be3956                            -      -      -        -         -      -      -      -  AVAIL

The mirror-1 removal is in progress:

root@storage1:~# zpool remove data mirror-1

root@storage1:~# zpool status -v
  pool: data
 state: ONLINE
  scan: resilvered 536G in 07:25:27 with 0 errors on Sun Apr 11 11:08:26 2021
remove: Evacuation of mirror in progress since Mon Apr 12 12:17:57 2021
    2.69G copied out of 513G at 67.2M/s, 0.52% done, 2h9m to go  <---------------------- status
config:

	NAME                                            STATE     READ WRITE CKSUM
	data                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    wwn-0x5000c500a22eef5e                      ONLINE       0     0     0
	    wwn-0x5000c500a23e85cf                      ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e3868                      ONLINE       0     0     0
	    wwn-0x5000c500a22eed6f                      ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    wwn-0x5000c500a23d19b6                      ONLINE       0     0     0
	    wwn-0x5000c500a22ef2c4                      ONLINE       0     0     0
	  mirror-3                                      ONLINE       0     0     0
	    wwn-0x5000c500a23e7af4                      ONLINE       0     0     0
	    wwn-0x5000c500a23d253b                      ONLINE       0     0     0
	  mirror-4                                      ONLINE       0     0     0
	    wwn-0x5000c500a23cf9ba                      ONLINE       0     0     0
	    wwn-0x5000c500a23e4511                      ONLINE       0     0     0
	cache
	  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN  ONLINE       0     0     0
	spares
	  wwn-0x5000c500c4be3956                        AVAIL   

errors: No known data errors

The mirror is removed fro the pool:

root@storage1:~# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data  21.8T  2.50T  19.3T        -         -    20%    11%  1.00x    ONLINE  -

root@storage1:~# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data                                            21.8T  2.50T  19.3T        -         -    20%    11%  1.00x    ONLINE  -
  mirror                                        5.45T   640G  4.83T        -         -    20%  11.4%      -  ONLINE  
    wwn-0x5000c500a22eef5e                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e85cf                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   640G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23d19b6                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a22ef2c4                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   641G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23e7af4                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23d253b                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   642G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23cf9ba                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e4511                          -      -      -        -         -      -      -      -  ONLINE  
cache                                               -      -      -        -         -      -      -      -  -
  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN   349G   175G   174G        -         -     0%  50.2%      -  ONLINE  
spare                                               -      -      -        -         -      -      -      -  -
  wwn-0x5000c500c4be3956                            -      -      -        -         -      -      -      -  AVAIL

wwn-0x5000c500a22eed6f can be declared as a spare disk:

root@storage1:~# zpool add data spare wwn-0x5000c500a22eed6f

root@storage1:~# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data                                            21.8T  2.50T  19.3T        -         -    20%    11%  1.00x    ONLINE  -
  mirror                                        5.45T   640G  4.83T        -         -    20%  11.4%      -  ONLINE  
    wwn-0x5000c500a22eef5e                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e85cf                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   640G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23d19b6                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a22ef2c4                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   641G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23e7af4                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23d253b                          -      -      -        -         -      -      -      -  ONLINE  
  mirror                                        5.45T   642G  4.83T        -         -    21%  11.5%      -  ONLINE  
    wwn-0x5000c500a23cf9ba                          -      -      -        -         -      -      -      -  ONLINE  
    wwn-0x5000c500a23e4511                          -      -      -        -         -      -      -      -  ONLINE  
cache                                               -      -      -        -         -      -      -      -  -
  nvme-INTEL_SSDPED1K375GAQ_FUKS70860038375AGN   349G   175G   174G        -         -     0%  50.2%      -  ONLINE  
spare                                               -      -      -        -         -      -      -      -  -
  wwn-0x5000c500c4be3956                            -      -      -        -         -      -      -      -  AVAIL   
  wwn-0x5000c500a22eed6f                            -      -      -        -         -      -      -      -  AVAIL
vsellier changed the status of subtask T3243: Replace /dev/sdb and /dev/sdc on storage1.staging from Open to Work in Progress.