New ZFS pool degraded need advice

SirMaster · Oct 16, 2013

Hey guys, I just finished setting up my new ZFS pool recently and have a question about the best course of action I should take.

Here is my pool output.

Code:

root@nick-server:~# zpool status
  pool: nickarray
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub canceled on Mon Oct 14 18:08:08 2013
config:

        NAME                                         STATE     READ WRITE CKSUM
        nickarray                                    DEGRADED     0     0     0
          raidz2-0                                   DEGRADED     0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317581       UNAVAIL      4     1     0  corrupted data
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317592       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202646       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202647       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJX0B900055       ONLINE       0     0     0
            ata-WDC_WD20EVDS-63T3B0_WD-WCAVY6715008  ONLINE       0     0     0
          raidz2-1                                   ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F299TA          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F29PE7          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F10V30          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2LWDL          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2PL9K          ONLINE       0     0     0
            ata-ST3000DM001-9YN166_W1F0SQF6          ONLINE       0     0     0

errors: No known data errors

Now the first thing that is strange to me is the UNAVAIL disk currently seems to be working, as in Linux can detect it and give me SMART status, etc.

Though in dmesg i can see that the disk seems to have come unconnected during a write? sdl is the disk that is UNAVAIL.

Code:

[47114.259711] sd 0:0:5:0: [sdl] Synchronizing SCSI cache
[47114.259771] sd 0:0:5:0: [sdl]
[47114.259772] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[47114.259872] mpt2sas0: removing handle(0x000e), sas_addr(0x4433221107000000)
[52393.618422] scsi 0:0:11:0: Direct-Access     ATA      SAMSUNG HD203WI  0002 PQ: 0 ANSI: 6
[52393.618430] scsi 0:0:11:0: SATA: handle(0x000e), sas_addr(0x4433221107000000), phy(7), device_name(0x0000000000000000)
[52393.618432] scsi 0:0:11:0: SATA: enclosure_logical_id(0x500304801183c700), slot(4)
[52393.618577] scsi 0:0:11:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[52393.618581] scsi 0:0:11:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
[52393.618791] sd 0:0:11:0: Attached scsi generic sg11 type 0
[52393.624699] sd 0:0:11:0: [sdl] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[52393.922016] sd 0:0:11:0: [sdl] Write Protect is off
[52393.922019] sd 0:0:11:0: [sdl] Mode Sense: 7f 00 10 08
[52393.933867] sd 0:0:11:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA
[52394.301663]  sdl: sdl1 sdl9
[52394.658032] sd 0:0:11:0: [sdl] Attached SCSI disk
[52505.851877] sd 0:0:11:0: [sdl] Synchronizing SCSI cache
[52505.852129] sd 0:0:11:0: [sdl]
[52505.852138] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[52505.852709] mpt2sas0: removing handle(0x000e), sas_addr(0x4433221107000000)
[52932.819529] sd 0:0:10:0: [sdo] Synchronizing SCSI cache
[52932.819749] sd 0:0:10:0: [sdo]
[52932.819757] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[52932.820298] mpt2sas0: removing handle(0x0010), sas_addr(0x4433221100000000)
[57924.699312] scsi 0:0:12:0: Direct-Access     ATA      SAMSUNG HD203WI  0002 PQ: 0 ANSI: 6
[57924.699320] scsi 0:0:12:0: SATA: handle(0x000e), sas_addr(0x4433221107000000), phy(7), device_name(0x0000000000000000)
[57924.699321] scsi 0:0:12:0: SATA: enclosure_logical_id(0x500304801183c700), slot(4)
[57924.699471] scsi 0:0:12:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[57924.699474] scsi 0:0:12:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
[57924.699633] sd 0:0:12:0: Attached scsi generic sg11 type 0
[57924.705544] sd 0:0:12:0: [sdl] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[57925.002807] sd 0:0:12:0: [sdl] Write Protect is off
[57925.002816] sd 0:0:12:0: [sdl] Mode Sense: 7f 00 10 08
[57925.014669] sd 0:0:12:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA
[57925.382416]  sdl: sdl1 sdl9
[57925.738668] sd 0:0:12:0: [sdl] Attached SCSI disk

I really think this is probably just a cable issue or something. I have replaced the cable.

What's the best course of action to correct this?

Since there are READ and WRITE errors on my pool do I need to do a zpool replace?

I'm assuming I can't just do zpool online because the disk is not in sync due to the errors right?

Anyways, I really think this disk is fine. What's the proper course of action to try using it again?

Thanks.

Nex7 · Oct 16, 2013

Well, either the path to it or the disk itself clearly isn't 'fine'.. none of your other disks are doing this, so it's pretty obvious something is or was wrong.

I'd start with a 'zpool clear <pool> <dev>' or just a 'zpool clear <pool>'. If the disk is again operating properly, that's going to kick off a resilver (if you're lucky, a quick resilver, but unless it's only been a short time since it disappeared, it is more likely to be a complete resilver). If the disk is still having issues, it may end up back in a bad state, though.

Since you've got raidz2 on the vdev, without being overly cavalier about this, you should be fairly safe to try zpool clear, zpool online, and/or zpool replace, since you'd still have to lose 2 more disks to experience data loss.

SirMaster · Oct 16, 2013

Thanks, I wanted to give this disk a second chance with a new cable, but if it fails again I will of course be replacing it.

I cannot do this until tonight after work though, so I won't have any updates until then.

SirMaster · Oct 16, 2013

So I underestimated ZFS haha.

All I did when I got home was replace the cable for that disk and reboot the server and when I went to check the zpool status.

I got this:

Code:

root@nick-server:~# zpool status
  pool: nickarray
 state: ONLINE
  scan: resilvered 4.41M in 0h0m with 0 errors on Wed Oct 16 17:39:30 2013
config:

        NAME                                         STATE     READ WRITE CKSUM
        nickarray                                    ONLINE       0     0     0
          raidz2-0                                   ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317581       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1CZ317592       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202646       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJ1LZ202647       ONLINE       0     0     0
            ata-SAMSUNG_HD203WI_S1UYJX0B900055       ONLINE       0     0     0
            ata-WDC_WD20EVDS-63T3B0_WD-WCAVY6715008  ONLINE       0     0     0
          raidz2-1                                   ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F299TA          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F29PE7          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F10V30          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2LWDL          ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F2PL9K          ONLINE       0     0     0
            ata-ST3000DM001-9YN166_W1F0SQF6          ONLINE       0     0     0

errors: No known data errors

So it looks like it was able to reattach the drive automatically and perform a quick resliver automatically.

I will run a scrub and keep an eye on the drive still though.

brutalizer · Oct 17, 2013

Cool. How many other storage solutions would have catched problems with cables? They would probably have continued to read/write errorneous data.

SirMaster · Oct 17, 2013

Yeah, I still don't know for sure if it was a bad cable or if it's something worse like the controler on the drive.

I do know that the drive dropped from the system at least once by itself and hardware RAID and MDADM I know should handle that as well.

MDADM I know would trigger a full disk rebuild when it saw the disk was re-connected back into the system successfully automatically.

omniscence · Oct 17, 2013

MDADM will only do a full rebuild if you don't have a write intent bitmap.

SirMaster · Oct 17, 2013

Ah, last time I used MDADM I did not have or use one of those.

New ZFS pool degraded need advice

SirMaster

2[H]4U

Nex7

Weaksauce

SirMaster

2[H]4U

SirMaster

2[H]4U

brutalizer

[H]ard|Gawd

SirMaster

2[H]4U

omniscence

[H]ard|Gawd

SirMaster

2[H]4U