Page MenuHomeSoftware Heritage

ZFS failures detected on belvedere
Closed, MigratedEdits Locked

Description

Several defaults were detected on the zfs pool on belvedere this night (2021-03-03) (P965)

This is the current pool status :

root@belvedere:~# zpool status -v data
  pool: data
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 03:14:42 with 33 errors on Wed Mar  3 03:38:44 2021
remove: Removal of vdev 3 copied 1.23T in 5h35m, completed on Tue Dec 15 19:28:52 2020
    88.3M memory used for removed device mappings
config:

	NAME                        STATE     READ WRITE CKSUM
	data                        DEGRADED     0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x50000397bc8969c1  ONLINE       0     0     0
	    wwn-0x50000396dc8a2191  ONLINE       0     0     0
	  mirror-1                  DEGRADED     0     0     0
	    wwn-0x500003965c898904  DEGRADED     0     0    12  too many errors
	    wwn-0x500003965c898910  DEGRADED     0     0    12  too many errors
	  mirror-5                  DEGRADED     0     0     0
	    wwn-0x50000396dc8a2161  DEGRADED     0     0    46  too many errors
	    wwn-0x500003972c8a44d1  DEGRADED     0     0    46  too many errors
	  mirror-6                  ONLINE       0     0     0
	    wwn-0x50000396dc8a2185  ONLINE       0     0     8
	    wwn-0x50000397bc8969b1  ONLINE       0     0     8
	  mirror-7                  ONLINE       0     0     0
	    wwn-0x5002538c409a8258  ONLINE       0     0     0
	    wwn-0x5002538c40ba6713  ONLINE       0     0     0
	  mirror-8                  ONLINE       0     0     0
	    wwn-0x5002538c409a825a  ONLINE       0     0     0
	    wwn-0x5002538c40ba66ab  ONLINE       0     0     0
	  mirror-9                  ONLINE       0     0     0
	    wwn-0x5002538c409a8252  ONLINE       0     0     0
	    wwn-0x5002538c409a8256  ONLINE       0     0     0
	  mirror-10                 ONLINE       0     0     0
	    wwn-0x5002538c40ba6716  ONLINE       0     0     0
	    wwn-0x5002538c409a8253  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.317
        /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.180
        /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.110
        /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.316

(more information on a previous identical issue on T2892)

Event Timeline

vsellier renamed this task from ZFS failure detected on belvedere to ZFS failures detected on belvedere.Mar 3 2021, 11:01 AM
vsellier changed the task status from Open to Work in Progress.
vsellier triaged this task as High priority.
vsellier created this task.
vsellier updated the task description. (Show Details)

It seems there are some reccurring alerts on the system journal about several disks since some time :

Feb 24 01:33:36 belvedere.internal.softwareheritage.org kernel: sd 0:0:14:0: [sdi] tag#808 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:36 belvedere.internal.softwareheritage.org kernel: sd 0:0:14:0: [sdi] tag#808 Add. Sense: Defect list not found
Feb 24 01:33:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:16:0: [sdk] tag#650 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:16:0: [sdk] tag#650 Add. Sense: Defect list not found
Feb 24 01:33:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:17:0: [sdl] tag#669 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:17:0: [sdl] tag#669 Add. Sense: Defect list not found
Feb 24 01:33:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:18:0: [sdm] tag#682 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:18:0: [sdm] tag#682 Add. Sense: Defect list not found
Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:21:0: [sdo] tag#668 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:21:0: [sdo] tag#668 Add. Sense: Defect list not found
Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:22:0: [sdp] tag#682 Sense Key : Recovered Error [current] [descriptor] 
Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:22:0: [sdp] tag#682 Add. Sense: Defect list not found
...
root@belvedere:/var/log# journalctl -k --since=yesterday | awk '{print $8}' | sort | uniq -c
    274 [sdi]
    274 [sdk]
    274 [sdl]
    274 [sdm]
    274 [sdo]
    274 [sdp]

There is no errors on the postgresql logs related to the files listed on the zfs status but I'm not sure the indexer database is read.

(not related or directly retated to the issue) Looking at some potential issues on disk i/os,I discovered a weird behavior change on the i/o on belvedere after the 2020-12-31 :

No sure what is responsible of that

Link to the grafana dashboard: https://grafana.softwareheritage.org/goto/ZPZqaqyMk

No problems are detected on the IDrac, smartcl on the drives looks ok.

I'm not sure of the next move to fix the problem, we will check with @olasd tomorrow

For the record, it seems the 4 impacted files are related to the primary key of the softwareheritage-indexer.content_mimetype table

softwareheritage-indexer=> select oid, relname from pg_class where oid in (774467031);
    oid    |        relname        
-----------+-----------------------
 774467031 | content_mimetype_pkey
(1 row)

(not related or directly retated to the issue) Looking at some potential issues on disk i/os,I discovered a weird behavior change on the i/o on belvedere after the 2020-12-31 :

No sure what is responsible of that

Link to the grafana dashboard: https://grafana.softwareheritage.org/goto/ZPZqaqyMk

Isn't this around when we've restarted production after expanding the storage pool?

Isn't this around when we've restarted production after expanding the storage pool?

The loaders were restarted in late November, but perhaps more of them were launched at this moment

In T2892, to solve the problem on the pk's files, the indexes were recreated. @olasd: Is it enough to solve the issue ?

If I resume what I have understood, I think this actions have to be executed:

  • Stop the content_mimetype indexers
  • drop and recreate the primary key on the content_mimetype table:
alter table content_mimetype drop CONSTRAINT content_mimetype_pkey;

# concurrently or not ?
create unique index concurrently content_mimetype_pkey on content_mimetype using btree (id, indexer_configuration_id);

alter table content_mimetype add constraint content_mimetype_pkey primary key using index content_mimetype_pkey;
  • restart the mimetype indexers
  • launch a new zpool scrub data

I don't think anything needs doing. The recreation of indexes in T2892 had to be done because the data was *actually* corrupted because of a disk failure. The current issue looks like the recurrent zpool scrub race condition we've been noticing after that issue was resolved.

You should check that the files are readable. If they are, then they're not actually corrupted.

I think one thing that would be worth doing would be upgrading to the most recent backport of zfs, rebooting on the new module, and running scrub again on that new version, to see if what looks like a race condition under high load is fixed...

The 4 files seems to be accessible without errors which look like a good news ;):

root@belvedere:~# time cp /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.317 /dev/null

real	0m6.343s
user	0m0.017s
sys	0m1.934s
root@belvedere:~# time cp /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.180 /dev/null

real	0m5.486s
user	0m0.012s
sys	0m1.872s
root@belvedere:~# time cp /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.110 /dev/null

real	0m4.909s
user	0m0.004s
sys	0m1.872s
root@belvedere:~# time cp /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.316 /dev/null

real	0m4.508s
user	0m0.017s
sys	0m1.871s
vsellier claimed this task.

ZFS status was reset and a scrub restart after the upgrade of the zfs packages.
No more errors are detected.

If such errors come back again, @olasd suggest to flash/configure(?) the raid controller to force the completely bypass the cache of the controller