I have a ZFS box that I inherited that is having a lot of issues. Checking the status I see there are a few drives with issues:
ganymede $ zpool status -x pool: dpool state: DEGRADEDstatus: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.action: Wait for the resilver to complete. scan: resilver in progress since Thu Feb 15 00:51:49 2024 88.1M scanned out of 36.2T at 6.77M/s, (scan is slow, no estimated time) 25.3M resilvered, 0.00% doneconfig: NAME STATE READ WRITE CKSUM dpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 12151399272057691850 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST8000NM0055-1RM112_ZA11E6HJ-part1 ata-ST8000NM0055-1RM112_ZA158JRW ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 ata-ST8000NM0055-1RM112_ZA15FG7E ONLINE 0 0 0 (resilvering) ata-ST8000NM0055-1RM112_ZA15FGCM DEGRADED 22 0 12 too many errors mirror-2 ONLINE 0 0 0 ata-ST8000NM0055-1RM112_ZA164M9J ONLINE 0 0 0 (resilvering) ata-ST8000NM0055-1RM112_ZA164QKP ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5J1K05JFE6C ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5J9K004FE6C ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5J9K005FE6C ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5LEK019FE6C ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5J9K007FE6C ONLINE 0 0 0 ata-TOSHIBA_MC04ACA600A_X5JFK001FE6C ONLINE 0 0 0errors: No known data errors
I am trying to pull the data off this system (back it up to s3) before replacing the disks. However, the drive in mirror-1 (ata-ST8000NM0055-1RM112_ZA15FGCM) is having issues and I believe it is slowing down all ops with the data (if I let resilvering go it dwindles down to K/s, a week later it is still running).
Looking at dmesg output I see a ton of these errors:
[ 464.866611] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)[ 464.866635] sd 1:0:27:0: [sdaa] tag#0 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK[ 464.866637] sd 1:0:27:0: [sdaa] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE[ 464.866653] sd 1:0:27:0: [sdaa] tag#2 Sense Key : Medium Error [current] [descriptor][ 464.866658] sd 1:0:27:0: [sdaa] tag#0 CDB: Read(16) 88 00 00 00 00 02 78 25 d7 38 00 00 00 08 00 00[ 464.866666] sd 1:0:27:0: [sdaa] tag#2 Add. Sense: Unrecovered read error[ 464.866670] print_req_error: I/O error, dev sdaa, sector 10605680440[ 464.866677] sd 1:0:27:0: [sdaa] tag#2 CDB: Read(16) 88 00 00 00 00 02 78 25 d5 68 00 00 00 f0 00 00[ 464.866767] print_req_error: critical medium error, dev sdaa, sector 10605680096
Considering that at least every mirror in the pool has a good drive, is there a way that I can simply remove the disk that is causing issues (I don't have physical access) so that I can get the data off the server?
I tried disabling the disk
syncecho 1 > /sys/block/sdaa/device/delete
But access to the data on ZFS continued to be extremely slow (i.e., 10 minutes to copy a 93mb file to AWS s3 using awscli).
Just trying to figure out the best path forward when a system is in this state.