ZFS Pool Demolished, Hard Drives or Controller?

jonnyjl · Sep 29, 2011

So I have a ZFS (OpenIndiana/ESXi All-in-One) pool with two 8x2TB raidz2 vdevs connected to an LSI 1068 based controller (I think running FW 1.30) that go into two "backplanes" on a Norco 4220. This vdev is running all Hitachi Deskstar 7k2000 drives.

I had two disks fault yesterday in one vdev (connected to one controller), a few hours apart, spares kicked in an resliver was running with no errors detected (in iostat too), but this morning the vdev is producing tons of unrecoverable data errors and all the disks in the vdev are showing degraded (I don't remember that yesterday..., disks should still say online right, the vdev is degraded).

I scrub every 1st and 15th, so I'm skeptical the data isn't good. I'm kind of screwed (just a home system... but damn tons of data) if I can't recover this data, but that's what you risk with no real backups.

I'm going to pick up a couple SAS 2008 (m1015) controllers to try out, but kind of bummed today. I may try flashing the controller with newer firmware and booting back up (I shut the VM down... sigh).

brutalizer · Sep 29, 2011

This sounds not normal. What are the error messages? Have you done anything weird lately?

As a last resort, you can try to import the zpool with the -F command, but try that last.

jonnyjl · Sep 29, 2011

brutalizer said:
This sounds not normal. What are the error messages? Have you done anything weird lately?

As a last resort, you can try to import the zpool with the -F command, but try that last.

Just read write errors in zpool, I'm trying to remember what errors in iostat, I think hard errors.

brutalizer · Sep 30, 2011

So, how is it going?

jonnyjl · Sep 30, 2011

brutalizer said:
So, how is it going?

I think I'm going to wait for the SAS2008 controllers. I haven't bothered powering back on the VM. I got two off eBay, would it be better (at least not worse) to split the vDev across two controllers? I suppose it could help isolate issues... My board has like 7 PCI 8x slots (Supermicro X8DTH-6F) so I'm not too worried about running out.

I don't know, I'm a little scared, but I guess it can't really get any worse lol.

I think I have a Spare 1068 Controller, maybe I'll try that :\

jonnyjl · Sep 30, 2011

brutalizer said:
This sounds not normal. What are the error messages? Have you done anything weird lately?

As a last resort, you can try to import the zpool with the -F command, but try that last.

By the way, if I were going to try to import it as a last resort, I would have to export it no? Since it still sees it? Or can I import in place?

danswartz · Sep 30, 2011

I believe you can force import. Or do 'zpool import' with no arguments, and import the numeric ID it shows you.

brutalizer · Oct 1, 2011

You can mount the zpool as read only, I think.

I think it is good that you try another controller. It is not normal to get massive errors on several disks at the same time. The probability that several disks crash at the same time is very small. It sounds as there are problems in the controller, in RAM or elsewhere. Also, do a RAM check. The problem is not in the disks, I suspect.

As a last resort, you can rollback in time and back to when the zpool was error free. But in this case, you will loose the most recent writes.

Keep us informed.

jonnyjl · Oct 1, 2011

brutalizer said:
You can mount the zpool as read only, I think.

I think it is good that you try another controller. It is not normal to get massive errors on several disks at the same time. The probability that several disks crash at the same time is very small. It sounds as there are problems in the controller, in RAM or elsewhere. Also, do a RAM check. The problem is not in the disks, I suspect.

As a last resort, you can rollback in time and back to when the zpool was error free. But in this case, you will loose the most recent writes.

Keep us informed.

Good idea, I should do a memcheck too. Memtest should be sufficient, no?

Meh, hopefully its not the CPU.

jonnyjl · Oct 1, 2011

Starting memtest now. Odd its showing ECC is off, but that may be a bug. I have both scrub types set to on :\ I remember checking in OpenSolaris to make sure it was recognizing ECC too, so I'm going to assume.. bug.

jonnyjl · Oct 2, 2011

4 Passes in and no errors. I'll let it go for a couple more than shut it down.

I'm hoping the dude from eBay ships soon.

jonnyjl · Oct 8, 2011

So I got a couple M1015s, flashed one of them and replaced the controller for the vdev that crapped up.

Here's the output. Anyone want to comment? The Device IDs are not "accurate", since they're showing the WWNs (... um.. how do I map that to a physical slot?), but you get the idea.

Error count isn't rising, even if the resliver fixes everything, that doesn't go a way right until you zpool clear?

Code:

 pool: svpool2 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Sep 29 07:52:42 2011
    2.88T scanned out of 22.0T at 296M/s, 18h47m to go
    440G resilvered, 13.10% done
config:

        NAME           STATE     READ WRITE CKSUM
        svpool2        DEGRADED     0     0     0
          raidz2-0     DEGRADED     0     0     0
            c12t0d0    DEGRADED     0     0     0  too many errors
            c10t5d0    DEGRADED     0     0     0  too many errors
            spare-2    DEGRADED     0     0     4
              c12t2d0  FAULTED      0     0     0  too many errors
              c10t7d0  ONLINE       0     0     0  (resilvering)
            c12t3d0    DEGRADED     0     0     0  too many errors
            c12t4d0    DEGRADED     0     0     0  too many errors
            spare-5    DEGRADED     0     0     4
              c12t5d0  FAULTED      0     0     0  too many errors
              c10t6d0  ONLINE       0     0     0  (resilvering)
            c12t6d0    DEGRADED     0     0     0  too many errors
            c12t7d0    DEGRADED     0     0     0  too many errors
          raidz2-1     ONLINE       0     0     0
            c9t4d0     ONLINE       0     0     0
            c9t5d0     ONLINE       0     0     0
            c9t6d0     ONLINE       0     0     0
            c9t7d0     ONLINE       0     0     0
            c10t0d0    ONLINE       0     0     0
            c10t1d0    ONLINE       0     0     0
            c10t2d0    ONLINE       0     0     0
            c10t3d0    ONLINE       0     0     0
        cache
          c10t4d0      ONLINE       0     0     0
        spares
          c10t6d0      INUSE     currently in use
          c10t7d0      INUSE     currently in use

errors: 1012418 data errors, use '-v' for a list

jonnyjl · Oct 9, 2011

Uhh... so it finished. Still showing that there's permanent errors. I did a devfsadm -Cv to remove stale links and rebooted

It's now doing this :| Um. I'm guessing its checking the other disk(s) that did produce errors on the other controller.

Sigh. Good sign?

PS I don't think I can ever recommend Hitachi drives. Not that I blame the drives, I blame the fact that I still don't have my replacement drive they received over a week ago. I'm only buying Western Digital from now on. Advanced replacement is fantastic (yes I called Hitachi and asked if that was an option).

Code:

  pool: svpool2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Oct  9 09:30:36 2011
    50.1G scanned out of 22.0T at 177M/s, 36h4m to go
    12.4G resilvered, 0.22% done
config:

        NAME           STATE     READ WRITE CKSUM
        svpool2        DEGRADED     0     0     0
          raidz2-0     DEGRADED     0     0     0
            c12t0d0    DEGRADED     0     0     0  too many errors
            c10t5d0    DEGRADED     0     0     0  too many errors
            spare-2    DEGRADED     0     0     0
              c12t2d0  FAULTED      0     0     0  too many errors
              c10t7d0  ONLINE       0     0     0  (resilvering)
            c12t3d0    DEGRADED     0     0     0  too many errors
            c12t4d0    DEGRADED     0     0     0  too many errors
            spare-5    DEGRADED     0     0     0
              c12t5d0  FAULTED      0     0     0  too many errors
              c10t6d0  ONLINE       0     0     0  (resilvering)
            c12t6d0    DEGRADED     0     0     0  too many errors  (resilvering)
            c12t7d0    DEGRADED     0     0     0  too many errors
          raidz2-1     ONLINE       0     0     0
            c9t4d0     ONLINE       0     0     0
            c9t5d0     ONLINE       0     0     0
            c9t6d0     ONLINE       0     0     0
            c9t7d0     ONLINE       0     0     0
            c10t0d0    ONLINE       0     0     0
            c10t1d0    ONLINE       0     0     0
            c10t2d0    ONLINE       0     0     0
            c10t3d0    ONLINE       0     0     0
        cache
          c10t4d0      ONLINE       0     0     0
        spares
          c10t6d0      INUSE     currently in use
          c10t7d0      INUSE     currently in use

errors: No known data errors

brutalizer · Oct 9, 2011

So what has happened with your zpool? Is it broke? Can you scrub it?

Again, as a last resort do a rollback in time:
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html
Look at the "PSARC" section.

jonnyjl · Oct 9, 2011

It's still resilvering after the reboot. It seems usable... but I'm keeping anything that would write to the zpool offline (smb/nfs... Comstar/ISCSI is still enabled but that was only used by DCs for backups and I'm keeping those powered down for now). Even if everything's fine, I'm still thinking of keeping the zpool from use until I get my cold spare back. Then start repairing/RMAing one at a time. No spares makes me nervous.

It's now showing cksum errors on the two spare devices. I would think it would be on the drive itself. No data errors yet and no other errors on any other disks (I was getting read/write errors at the time I originally posted).

I have to head back to work tomorrow, I doubt this will be finished before then, but I have my SSH tunnel up. I'm hoping this will be fine, then I'll export and import (if only to "fix" the Device IDs). Then maybe upgrade zpool (still on version 22, I don't see myself going back to OpenSolaris, so I might as well finally do this), then scrub? Or should I scrub before zpool upgrade?

Code:

  pool: svpool2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Oct  9 09:30:36 2011
    4.10T scanned out of 22.0T at 346M/s, 15h1m to go
    558G resilvered, 18.66% done
config:

        NAME           STATE     READ WRITE CKSUM
        svpool2        DEGRADED     0     0     0
          raidz2-0     DEGRADED     0     0     0
            c12t0d0    DEGRADED     0     0     0  too many errors
            c10t5d0    DEGRADED     0     0     0  too many errors
            spare-2    DEGRADED     0     0 76.3K
              c12t2d0  FAULTED      0     0     0  too many errors
              c10t7d0  ONLINE       0     0     0  (resilvering)
            c12t3d0    DEGRADED     0     0     0  too many errors
            c12t4d0    DEGRADED     0     0     0  too many errors
            spare-5    DEGRADED     0     0 76.3K
              c12t5d0  FAULTED      0     0     0  too many errors
              c10t6d0  ONLINE       0     0     0  (resilvering)
            c12t6d0    DEGRADED     0     0     0  too many errors  (resilvering)
            c12t7d0    DEGRADED     0     0     0  too many errors
          raidz2-1     ONLINE       0     0     0
            c9t4d0     ONLINE       0     0     0
            c9t5d0     ONLINE       0     0     0
            c9t6d0     ONLINE       0     0     0
            c9t7d0     ONLINE       0     0     0
            c10t0d0    ONLINE       0     0     0
            c10t1d0    ONLINE       0     0     0
            c10t2d0    ONLINE       0     0     0
            c10t3d0    ONLINE       0     0     0
        cache
          c10t4d0      ONLINE       0     0     0
        spares
          c10t6d0      INUSE     currently in use
          c10t7d0      INUSE     currently in use

errors: No known data errors

brutalizer · Oct 9, 2011

You did not mix 512 byte sector disks, with 4kb sector disks? There is another thread here, and he got problems with 512 byte disks, I think. Search for a thread "iostat error solaris" or something like that.

I would do scrub before doing anything. scrub cleans up all errors.

jonnyjl · Oct 9, 2011

brutalizer said:
You did not mix 512 byte sector disks, with 4kb sector disks? There is another thread here, and he got problems with 512 byte disks, I think. Search for a thread "iostat error solaris" or something like that.

I would do scrub before doing anything. scrub cleans up all errors.

Nope, they are all 512Byte Disks. I haven't RMA'd any of them yet so they're the original disks (7k2000s).

The other vdev is made up of RE4-GPs, so they're also 512Byte disks.

That makes sense, I'll kick off a scrub. Sigh, that's another 24 hours haha.

Maybe Hitachi will finally deliver my cold spare tomorrow.

jonnyjl · Oct 10, 2011

Errr, so I did do a zpool clear and the two faulted disks resilvered in 2 hours and came back online. Spares dropped back.

Scrubbing now. Man Scrub sure is fast when you don't have any other IO

jonnyjl · Oct 11, 2011

I knew there was a reason why I love zfs

No errors during scrub, everything looks good.

I have to find a better way to map the device ids to physical slots now (Mailing list has some suggestions), I think I'll just inventory my drives from now on.

Eschertias · Oct 11, 2011

jonnyjl said:
I knew there was a reason why I love zfs

No errors during scrub, everything looks good.

I have to find a better way to map the device ids to physical slots now (Mailing list has some suggestions), I think I'll just inventory my drives from now on.

The Norco backplanes with the SAS cables will map them 0/1/2/3 in order on almost every LSI controller. What I do is get those printable labels and just label every row, and use a dd read c1t1d0 dev/null to make the access light stay a solid blue. After you spot check every drive to make sure you know which one is which, it's pretty easy to just check the labels on the edge of the case to see which drive to yank.

tormentum · Oct 11, 2011

Eschertias said:
What I do is get those printable labels and just label every row, and use a dd read c1t1d0 dev/null to make the access light stay a solid blue. After you spot check every drive to make sure you know which one is which, it's pretty easy to just check the labels on the edge of the case to see which drive to yank.

Agreed. I actually label each of my disks using the pretty WWN id's that are crated for SAS controller attached disks. This ID never changes on the same system, regardless of which port they are plugged into (admittedly I don't know if they remain the same if you move disks to a new system).

Code:

  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        tank                       ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c0t50024E900431D9ACd0  ONLINE       0     0     0
            c0t50024E900430BBF2d0  ONLINE       0     0     0
            c0t50024E900430BC2Ad0  ONLINE       0     0     0
            c0t50024E900431D9B0d0  ONLINE       0     0     0
          raidz1-1                 ONLINE       0     0     0
            c0t5000CCA368C083D4d0  ONLINE       0     0     0
            c0t5000CCA368C15222d0  ONLINE       0     0     0
            c0t5000CCA368C0C707d0  ONLINE       0     0     0
            c0t5000CCA368C087C2d0  ONLINE       0     0     0
          raidz1-2                 ONLINE       0     0     0
            c0t50024E900430BBEEd0  ONLINE       0     0     0
            c0t5000CCA368C153B9d0  ONLINE       0     0     0
            c0t5000CCA368C170F8d0  ONLINE       0     0     0
            c0t5000CCA368C146A1d0  ONLINE       0     0     0
          raidz1-3                 ONLINE       0     0     0
            c0t5000CCA369CD0693d0  ONLINE       0     0     0
            c0t5000CCA369CDA05Fd0  ONLINE       0     0     0
            c0t5000CCA369CE4495d0  ONLINE       0     0     0
            c0t5000CCA369CD23DAd0  ONLINE       0     0     0
        cache
          c15t3d0                  ONLINE       0     0     0

Whilst I keep a copy of WWN's and port locations, I still prefer to power down the box in the event of a failure and manually check the disk ID's. It's only a home SAN for my lab and files, so downtime's not an issue.

Rectal Prolapse · Oct 11, 2011

To the OP: I had this situation a couple of years ago. The fix? I replaced the power supply.

You should take out the PSU in your system and replace with a known good PSU. Hell, use your desktop PC's PSU if you have to.

You have to go back to first principles when you have a failure like this. Start with the PSU.

brutalizer · Oct 11, 2011

Yes, it could actually be the power supply. Read this
http://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta

And maybe you can change the heading to "[SOLVED]" or something, if possible?

jonnyjl · Oct 11, 2011

Good idea, since no other disks outside that vdev was faulty I hadn't considered anything other than the controller. I think the two backplanes in question are on the same rail/cable so definitely a possibility . No issues/faults/events with voltages though.

If I see it again I'll consider replacing it. Hurrah for ZFS' hardiness.

PS I'm going to go old school and pull the drives and copy the WWNs (IIRC, they're labeled). It does seem as though its keeping them in order (zpool status shows 0-7 in order), from what I can tell by looking at the serials. Honestly, I'm just going to do what I should have done from the beginning and keep an inventory (Spreadsheet kept on Google Docs

) of all the relevant disk information so I can do warranties without access to my OI box or if the disk fails completely and I can't query the serial#.. which happens at times. I should combine this with a more robust script that e-mails me the Device ID of a faulted device, right now I'm just notifying that a pool is Degraded/Faulted.

bAMtan2 · Oct 11, 2011

you have two identical arrays except for the disks. same controller, system, psu, software. the only difference is the disks, right? so you should be worried about the disks, dog.

PS: I don't use all the same disks in one array.

Rectal Prolapse · Oct 11, 2011

You won't get any events/faults reported on the rails in situations like this - the voltage polling rate is too slow to get flagged. I would replace the PSU and see if the problem goes away. Check the connections to the backplane. Check the connections on the controller. Replace the controller<->backplane cable. Replace the controller. Put the controller on a separate slot, etc.

If it was bad RAM you will probably not get any CRC errors as the damage to the data was likely done before the CRC was calculated and written to disk. We could probably rule that out for now as you're getting errors right away heh.

You got lots of things to try and replacing the PSU is the fastest and easiest which is why I always do that check first (and a lot of the time the PSU can be blamed! BAD PSU, NO ELECTRON!).

jonnyjl · Oct 11, 2011

bAMtan2 said:
you have two identical arrays except for the disks. same controller, system, psu, software. the only difference is the disks, right? so you should be worried about the disks, dog.

PS: I don't use all the same disks in one array.

I'm worried about the disks, but with no SMART Data or current errors to back it up its kind of hard to say for sure.

After the controller was replaced, I went through two resilvers against just those two that originally faulted out (different backplanes), the third one never faulted, but started showing errors. And now a scrub. Those are pretty intensive IO operations, I would think if its still faulty after the controller replacement it would show somewhere (either in zpool status, iostat, or SMART).

I also ran Memtest to rule out CPU/Mem

I'm still planning on replacing the disks that originally faulted, but meh, that will be a few weeks, I'm going to do it one at a time since it takes so god damn long to get it back from Hitachi.

I also have two more HBAs (saving Expander for expansion case) that I'll replace. I'll do the HBA holding my Spares first and run through some replace scenarios.

Anyone recommend a good PSU? I might as well keep one on standby, I have a Corsair 850HX (Norco 4220), I guess I should look into some redundant power supplies too...

Rectal Prolapse · Oct 11, 2011

Just about anything Corsair seems good - I use the Corsair 620HX in my server. That is OOP but the 650HX would work.

I have 24 drives in the 4224 with the Corsair and it works fine.

So you didn't replace the PSU yet? What is your hardware anyways?

jonnyjl · Oct 11, 2011

Rectal Prolapse said:
Just about anything Corsair seems good - I use the Corsair 620HX in my server. That is OOP but the 650HX would work.

I have 24 drives in the 4224 with the Corsair and it works fine.

So you didn't replace the PSU yet? What is your hardware anyways?

Motherboard: X8DTH-6F
CPU: Intel Xeon E5520
Memory: 6x2GB Patriot ECC Modules (runs @ 800 since I'm populated)
Controllers: Use to run 3xUSAS-L8i (LSI 1068E), now I'm running 2xIBM M1015 (LSI SAS2008) and 1x USAS-L8I
Case: Norco 4220 Rackmounted in a cabinet
UPS: Some Cyberpower One :\ (Before Cabinet)
PowerSupply: Corsair 850HX
HD:
2xRaidz Vdevs (8xHitachi 7K2000s and 8xWestern Digital RE4-GPs)
3xHot Spares (Hitachi 7K2000s
1xFujitsu 15K 73Gig SAS Drive (l2arc, hey I got it cheap, why not).
Running ESXi 4.1
OpenIndiana b151a, controllers are of course passed-through. OI gets 2vCPUs and 6GB of RAM (Reserved).

Rectal Prolapse · Oct 12, 2011

Neat. I didn't know Patriot made ECC RAM!

What a coincidence - I just got myself 4 M1015s myself, but haven't tried them yet - still using BR10i cards for now until I get my Supermicro X9SCM-F.

brutalizer · Oct 12, 2011

jonnyjl said:
I'm worried about the disks, but with no SMART Data or current errors to back it up its kind of hard to say for sure.

I wouldnt trust on SMART too much. Sometimes disks crash without any warnings, and sometimes SMART warns but the disks works fine for several years. ZFS is much more sensitive to errors and notices the slightest problem. It could be problem with the PSU, which SMART will never notice - because SMART only knows about the disk. But ZFS knows the whole chain: from RAM down to disk and reports the slightest problem in the long chain. SMART does not the whole chain, it might be the PSU that is troublesome. SMART monitors only the disk, and the monitoring is even not good. ZFS does a much better job on this.

I trust more on ZFS than SMART.

jonnyjl · Oct 12, 2011

brutalizer said:
I wouldnt trust on SMART too much. Sometimes disks crash without any warnings, and sometimes SMART warns but the disks works fine for several years. ZFS is much more sensitive to errors and notices the slightest problem. It could be problem with the PSU, which SMART will never notice - because SMART only knows about the disk. But ZFS knows the whole chain: from RAM down to disk and reports the slightest problem in the long chain. SMART does not the whole chain, it might be the PSU that is troublesome. SMART monitors only the disk, and the monitoring is even not good. ZFS does a much better job on this.

I trust more on ZFS than SMART.

For sure, that's why I went the ECC route, to make sure everything in the chain had some kind of protection. SMART's just one source (yeah, I've never had a failing disk ever give me SMART errors), but considering the errors isn't consistent after the controller swap, then at best its a transient issue that will be PITA. Since I can RMA the disks, I will. I can RMA the PS I suppose, but then the whole thing has to go down

I've been looking at redundant power supplies, but compatibility is questionable. Man, why is it so hard to win the lotto

Supermicro or Chenbro case with multiple redundant PSes... sweet!

ZFS Pool Demolished, Hard Drives or Controller?

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

n00b

Limp Gawd

Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Gawd

Limp Gawd

Gawd

Limp Gawd

Gawd

[H]ard|Gawd

Limp Gawd