Hard errors on ZFS RAIDZ2 drives

farscapesg1

2[H]4U
Joined
Aug 4, 2004
Messages
2,648
So, I'm finally getting around to building a RAIDZ2 array with 8 drives. 5 WD20EADS and 3 Hitachi 72K drives. Got the array built.. but I'm getting hard errors listed on the drives as soon as I start writing to them. One or two I could believe, but I'm talking about pretty much all of them. On top of that, the system won't pull any SMART temperature for these devices. I thought maybe it was the the SAS3081E-R hba card I'm using, so I moved all of them to the onboard 1068e controller (which has the 4 1TB drives) and the M1015 card that has my 4 750GB drives, with the same issues.

Anytime I start transferring data to the drives, I can watch the hard errors count on these drives go up. Being ZFS, I wouldn't think that the problem would be TLER, and since these are mixed drives (which I've been told is okay to do with ZFS), I'm not sure what exactly to check.

 
Is it running Solaris/OpenSolaris/FreeBSD natively or is it virtualized?
What actual errors are you getting?
Have you tried running memtest86+ for 24h?
//Danne
 
Running Openindiana in a virtual machine with 10GB ram and 2 vCPU. The HBA cards are passed through to the VM. I'm not really seeing any actual errors, just my paranoid nature in seeing errors listed in the napp-it interface and no smart data being reported.

ESXi box has been running for over a year and at thIs point it would be very difficult to run Memtest on it...
 
Unless you can get the actually errors generated there's no way anyone can give you advice on whats wrong and how to troubleshoot.
//Danne
 
You have moved them to another controller....

Have you moved them to a new breakout cable?
See if you get the same errors swapping cable from the 750g drives to the EADS drives;)
 
I'll try that tonight I guess :) I do have one more spare cable I can swap out if it is a bad cable (always try to keep a spare around).
 
Take one of the problem drives and put it in another PC running non-virtualized windows. Run a diagnostic program to see if there are actual errors on the drive. Also look at the SMART info on this other PC.

(with napp-it you can sometimes see SMART info - I guess in your case you can't though :( )
 
Well this is interesting. I swapped 4 of the WDEADS drives with the 750GB drives and I'm not experiencing the hard errors anymore, on either set of disks. Of course I also replaced the cable to that backplane, so now I'm not sure if it is a cable issue, power issue, or backplane issue.

I'll try swapping the 750 GB drives to the next backplane down (which had the other 4 2TB drives that were reporting issues and see if I get the same problems there (which still has the original cable).
 
After all that moving I guess the drives just needed a new home to be happy :) 7 of the 8 drives were happy in their new locations in the case, with only one drive throwing hard errors and refusing to show any smart temperature data. So I pulled it out, dropped it in another box and ran the Hitach Drive Fitness test to do a complete scan. Came back with now errors. Put it in a Windows box and checked the SMART status and everything shows good and Active@ Disk Monitor reports it at 95% health and only 6 months active runtime. So I put it back in the caddy, slid it back into the Norco, ran a zpool clear to remove the Faulted status... and now I get SMART temp readings in Napp-it :confused: Guess the drive just needed to be kicked a little :p

I haven't gotten around to testing my RAID10 array in the last backplane slots, mainly because I've moved about 300 GB worth of VMs to it without a single error being reported and I'm of the mind to leave a working system alone so I can get on with other things besides "tinkering" with the server for awhile....
 
Dicky backplanes it sounds like.

Drives are not fully seating home against the SAS connectors.

Keep your eye on them to see if they wiggle their way out with a few months more vibration.;)
 
Could be. The Norco cases are known to have backplane issues a lot of times, especially the older models. Just find it odd since I've been running my WHS box in it for the last year and never had any problems with the drives... at least that I knew of ;) I'll definitely be watching it... and saving up for a replacement case :p
 
I love that ZFS is so sensitive to pick up the slightest problem that even SMARTS can not detect. ZFS warn you long before SMARTS.
 
Back
Top