Page MenuHomeSoftware Heritage

Dead on arrival disk in the kafka3 server
Closed, MigratedEdits Locked

Description

One of the kafka3 disks has reported as failed by zfs.

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 62
  class: statechange
  state: FAULTED
   host: kafka3
   time: 2020-08-26 13:32:07+0000
  vpath: /dev/disk/by-id/wwn-0x50000399f8982a7d-part1
  vphys: pci-0000:18:00.0-scsi-0:0:0:0
  vguid: 0xBD45E7F1FCC8E8CE
  devid: scsi-350000399f8982a7d-part1
   pool: 0x2030E25D0D8D754E

dmesg output:

Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#599 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#599 Sense Key : Hardware Error [deferred] [descriptor] 
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#599 ASC=0x44 <<vendor>>ASCQ=0xa3 
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#599 CDB: Write(16) 8a 00 00 00 00 00 0c 64 81 88 00 00 00 28 00 00
Aug 26 13:30:03 kafka3 kernel: blk_update_request: I/O error, dev sda, sector 207913352 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#598 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Aug 26 13:30:03 kafka3 kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x50000399f8982a7d-part1 error=5 type=2 offset=106450587648 size=20480 flags=40080c80
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#598 Sense Key : Hardware Error [deferred] [descriptor] 
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#598 ASC=0x44 <<vendor>>ASCQ=0xa3 
Aug 26 13:30:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#598 CDB: Write(16) 8a 00 00 00 00 00 0c 64 81 b0 00 00 00 08 00 00
Aug 26 13:30:03 kafka3 kernel: blk_update_request: I/O error, dev sda, sector 207913392 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Aug 26 13:30:03 kafka3 kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x50000399f8982a7d-part1 error=5 type=2 offset=106450608128 size=4096 flags=180880
Aug 26 13:31:36 kafka3 kernel: sd 0:0:0:0: Power-on or device reset occurred
Aug 26 13:31:51 kafka3 kernel: sd 0:0:0:0: [sda] tag#630 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
Aug 26 13:31:51 kafka3 kernel: sd 0:0:0:0: [sda] tag#630 Sense Key : Hardware Error [deferred] 
Aug 26 13:31:51 kafka3 kernel: sd 0:0:0:0: [sda] tag#630 ASC=0x44 <<vendor>>ASCQ=0xa3 
Aug 26 13:31:51 kafka3 kernel: sd 0:0:0:0: [sda] tag#630 CDB: Write(16) 8a 00 00 00 00 00 04 cc b1 30 00 00 00 10 00 00
Aug 26 13:31:51 kafka3 kernel: blk_update_request: I/O error, dev sda, sector 80523568 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Aug 26 13:31:51 kafka3 kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x50000399f8982a7d-part1 error=5 type=2 offset=41227018240 size=8192 flags=40080c80
Aug 26 13:32:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#633 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=1s
Aug 26 13:32:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#633 Sense Key : Hardware Error [deferred] 
Aug 26 13:32:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#633 ASC=0x44 <<vendor>>ASCQ=0xa3 
Aug 26 13:32:03 kafka3 kernel: sd 0:0:0:0: [sda] tag#633 CDB: Write(16) 8a 00 00 00 00 00 08 8a 88 90 00 00 00 40 00 00
Aug 26 13:32:03 kafka3 kernel: blk_update_request: I/O error, dev sda, sector 143296656 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Aug 26 13:32:03 kafka3 kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x50000399f8982a7d-part1 error=5 type=2 offset=73366839296 size=32768 flags=40080c80

Looks like the disk in the first SAS slot.

Event Timeline

olasd triaged this task as Unbreak Now! priority.Aug 27 2020, 12:31 PM
olasd created this task.

Running an extended smart test.

root@kafka3:~# smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.7.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TNSHIBA
Product:              MF06SBA800EX
Revision:             EH08
Compliance:           SPC-4
User Capacity:        2,198,989,700,608 bytes [2.19 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Serial number:        20Q0A07LF0GF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Sep  2 08:53:56 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     28 C
Drive Trip Temperature:        64 C

Manufactured in week 09 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  18
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
write:        12        0        12        12   227730567168          0.000           0
verify:        0        0         0         0   1601855488          0.000           0

Non-medium error count:       41

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 0  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background short  Completed                   -     847                 - [-   -    -]
# 3  Reserved(7)       Completed                  64       4                 - [-   -    -]
# 4  Background short  Completed                   -       3                 - [-   -    -]

Long (extended) Self Test duration: 45122 seconds [752.0 minutes]

The number of correction algorithm invocations is "a bit" high.

This smartctl output looks like a glitch.

Current output after the full smart test (ignore the cancelled one, I have fat fingers...):

# sudo smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.7.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG06SCA800EY
Revision:             EH08
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000399f8982a7d
Serial number:        20Q0A07LF1GF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Sep  3 14:38:32 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        65 C

Manufactured in week 09 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  18
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     1584      1585      1584       4747      60556.729           1
write:         0       12        12        12        204        238.624           0
verify:        0        0         0         0          0          5.897           0

Non-medium error count:       41

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -     865                 - [-   -    -]
# 2  Foreground long   Aborted (device reset ?)    -     853                 - [-   -    -]
# 3  Background long   Aborted (by user command)   -     853                 - [-   -    -]
# 4  Background short  Completed                   -     847                 - [-   -    -]
# 5  Reserved(7)       Completed                  64       4                 - [-   -    -]
# 6  Background short  Completed                   -       3                 - [-   -    -]

Long (extended) Self Test duration: 45122 seconds [752.0 minutes]

In the meantime, the dmesg showed the following messages:

Sep 02 09:01:36 kafka3 kernel: sd 0:0:0:0: Power-on or device reset occurred
Sep 02 15:46:12 kafka3 kernel: sd 0:0:0:0: Power-on or device reset occurred

I'll run some fio commands to try and get the disk to croak again.

olasd changed the task status from Open to Work in Progress.Sep 4 2020, 2:53 PM
olasd claimed this task.

I've readded the disk to the zpool: sudo zfs replace data wwn-0x50000399f8982a7d wwn-0x50000399f8982a7d

I've then run fio on all machines to load test the disks:

fio --name=seqread --rw=readwrite --direct=1 --ioengine=libaio --bs=1M --numjobs=8 --size=1T --runtime=0  --group_reporting

This creates 8 random 1 TB files, and does a sequential read on them.

After this operation, the disk hasn't reported any issue whatsoever.

smartctl output as follows:

root@kafka3:~# smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.7.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG06SCA800EY
Revision:             EH08
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000399f8982a7d
Serial number:        20Q0A07LF1GF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Sep  4 12:50:59 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        65 C

Manufactured in week 09 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  18
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     1634      1635      1634       4897      64233.876           1
write:         0       12        12        12        204       4111.787           0
verify:        0        0         0         0          0          5.897           0

Non-medium error count:       41

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -     865                 - [-   -    -]
# 2  Foreground long   Aborted (device reset ?)    -     853                 - [-   -    -]
# 3  Background long   Aborted (by user command)   -     853                 - [-   -    -]
# 4  Background short  Completed                   -     847                 - [-   -    -]
# 5  Reserved(7)       Completed                  64       4                 - [-   -    -]
# 6  Background short  Completed                   -       3                 - [-   -    -]

Long (extended) Self Test duration: 45122 seconds [752.0 minutes]

Looks like the number of corrected errors has increased a little (150 more errors).

I've now started a zpool scrub:

scan: scrub in progress since Fri Sep  4 11:58:40 2020
      8.00T scanned at 2.53G/s, 2.41T issued at 781M/s, 8.00T total
      0B repaired, 30.13% done, 0 days 02:05:01 to go

looks good so far...

I think we're okay to start prod workloads on the machine, and to keep an eye on whether the disk really is bad.

Well, the disk is really bad, it got kicked off of the pool again last night.

Ticket submitted to Dell (# 1035505066) with the following attachment

Dell diagnostics package sent to dell customer services.

Trying to arrange the shipment of a replacement disk now.

Enabled the disk location blinkenlichten:

olasd@kafka3:~$ sudo megacli -Pdlocate -start -physdrv[32:0] -a0
                                     
Adapter: 0: Device at EnclId-32 SlotId-0  -- PD Locate Start Command was successfully sent to Firmware 

Exit Code: 0x00

Handled the replacement drive (coordinating with Christophe at DSI-SP)

stopped the blinkenlights

olasd@kafka3:~$ sudo megacli -Pdlocate -stop -physdrv[32:0] -a0
                                     
Adapter: 0: Device at EnclId-32 SlotId-0  -- PD Locate Stop Command was successfully sent to Firmware 

Exit Code: 0x00

made the disk JBOD

olasd@kafka3:~$ sudo megacli -Pdmakejbod -physdrv[32:0] -a0
                                     
Adapter: 0: EnclId-32 SlotId-0 state changed to JBOD.

Exit Code: 0x00
olasd@kafka3:~$ journalctl -kf
-- Logs begin at Mon 2020-09-07 20:17:01 UTC. --
Sep 10 12:22:57 kafka3 kernel: megaraid_sas 0000:18:00.0: scanning for scsi0...
Sep 10 12:22:57 kafka3 kernel: scsi 0:0:0:0: Direct-Access     TOSHIBA  MG06SCA800EY     EH07 PQ: 0 ANSI: 6
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] Disabling DIF Type 2 protection
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] 4096-byte physical blocks
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] Write Protect is off
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] Mode Sense: d3 00 10 08
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Sep 10 12:22:57 kafka3 kernel: sd 0:0:0:0: [sda] Attached SCSI disk

Replaced the disk in the zfs pool, finding the wwn in /dev/disk/by-id

olasd@kafka3:~$ sudo zpool replace data wwn-0x50000399f8982a7d wwn-0x5000039a483b7b41
olasd@kafka3:~$ sudo zpool status
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Sep 10 12:28:47 2020
	8.00T scanned at 1.14T/s, 5.98T issued at 875G/s, 8.00T total
	16.7M resilvered, 74.80% done, 0 days 00:00:02 to go
config:

	NAME                          STATE     READ WRITE CKSUM
	data                          DEGRADED     0     0     0
	  mirror-0                    DEGRADED     0     0     0
	    replacing-0               DEGRADED     0     0     0
	      wwn-0x50000399f8982a7d  FAULTED      0    31     0  too many errors
	      wwn-0x5000039a483b7b41  ONLINE       0     0     0  (resilvering)
	    wwn-0x50000399f898ec5d    ONLINE       0     0     0
	  mirror-1                    ONLINE       0     0     0
	    wwn-0x50000399f8982d29    ONLINE       0     0     0
	    wwn-0x50000399f898ec2d    ONLINE       0     0     0
	  mirror-2                    ONLINE       0     0     0
	    wwn-0x50000399f898ec45    ONLINE       0     0     0
	    wwn-0x50000399f8982b61    ONLINE       0     0     0
	  mirror-3                    ONLINE       0     0     0
	    wwn-0x50000399f898ec4d    ONLINE       0     0     0
	    wwn-0x50000399f898ec41    ONLINE       0     0     0
	cache
	  wwn-0x58ce38ee20d2d135      ONLINE       0     0     0
	  wwn-0x58ce38ee20d2d10d      ONLINE       0     0     0

errors: No known data errors

All fine and dandy now:

olasd@kafka3:~$ sudo zpool status
  pool: data
 state: ONLINE
  scan: resilvered 2.02T in 0 days 03:23:20 with 0 errors on Thu Sep 10 19:15:30 2020
config:

	NAME                        STATE     READ WRITE CKSUM
	data                        ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x5000039a483b7b41  ONLINE       0     0     0
	    wwn-0x50000399f898ec5d  ONLINE       0     0     0
	  mirror-1                  ONLINE       0     0     0
	    wwn-0x50000399f8982d29  ONLINE       0     0     0
	    wwn-0x50000399f898ec2d  ONLINE       0     0     0
	  mirror-2                  ONLINE       0     0     0
	    wwn-0x50000399f898ec45  ONLINE       0     0     0
	    wwn-0x50000399f8982b61  ONLINE       0     0     0
	  mirror-3                  ONLINE       0     0     0
	    wwn-0x50000399f898ec4d  ONLINE       0     0     0
	    wwn-0x50000399f898ec41  ONLINE       0     0     0
	cache
	  wwn-0x58ce38ee20d2d135    ONLINE       0     0     0
	  wwn-0x58ce38ee20d2d10d    ONLINE       0     0     0

errors: No known data errors