Best way to troubleshoot disk or adapter issues?

rthorntn · Feb 17, 2014

What a nightmare, luckily I have no data on there yet!

ESXi 5.5 on a Supermicro X9SRH-7TF with 24GB ECC RAM, the onboard LSI 2308 IT in passthrough mode.

32GB and 64GB SLC SSDs

Six (new disks in vacuum-sealed bags dated Jan 2011) Hitachi HDS5C302-A580 2TB drives (that have been lying around gathering dust for the best part of two years, shame on me) connected with whatever SATA cables I could find (I did suspect dodgy cables and swapped them round a bit and added a few new ones, no luck there)

So I setup Solaris 11.1 with the instructions here http://www.abisen.com/lsi2308-with-solaris11.1.html, all working, until I created a pool from the CLI, added napp-it and ran filebench, then all hell broke loose… probably important is the fact I stupidly accepted the ESX default of 3GB RAM for the 11.1 VM!

I setup like this:

zpool create raidarray mirror c0t5000CCA369C5BAA6d0 c0t5000CCA369C66F1Ed0 mirror c0t5000CCA369C70ED5d0 c0t5000CCA369C728F2d0 mirror c0t5000CCA369C72952d0 c0t5000CCA369C72955d0

Worked fine, then when running filebench, the console and UI slowed, probably the beginnings of memory exhaustion, then the zfs info was showing the first vdev as degraded, so I cleared the fault and scrubbed it, very bad idea the console and the UI froze completely so I had to do a power cycle, while it was down I cranked the RAM to 8GB, solaris wouldn’t boot unless I loaded with the pre-napp-it solaris grub entry, when loaded the raidarray was missing and no matter what I did I couldn’t get the same six disk striped mirror back again. At some point before the power cycle I caught a glimpse of read and write errors on both disks in the first vdev mirror c0t5000CCA369C5BAA6d0 and c0t5000CCA369C66F1Ed0, bugger!

So I tried Ubuntu, not happy with two of the six disks, because the naming and positioning was different I can only assume it’s the same ones, SMART checked out on all but one disk.

FreeNAS, same not happy with two of the six disks, naming is different again and so is positioning.

I gave tried fdisk, parted, gparted, format…most of the tools just froze on the suspect disks and I could see errors in the dmesg... I have hit a brick wall for today.

What are the chances of two new HDDs out of six being DOA, what should I do to try and troubleshoot from here, I will take all the disks out and hook them to a standard SATA port one by one and run some stuff on them, what OS, what tools?

In coming totally clean the disks aren’t properly physically mounted yet, I created a basic aluminium frame out of makerbeam and slotted the six disks in there, with gaps in between them and a 140mm fan blowing air through the gaps, death by combined vibration did pop in to my mind but surely it wouldn’t happen after a lot less than 24 hours total run time?

Please help.

Cheers
Richard

raiderj · Feb 18, 2014

It sounds like you're doing a lot of jumping around in a way that doesn't really isolate variables. Makes troubleshooting difficult.

Can you boot up a basic instance of Ubuntu off a Live disc and run a SMART scan on a single disk? If it fails, then try swapping out the SATA cable. If that works, then you can try moving the disk to all your SATA ports. Would tell you which ports or controllers might be bad. Could also be a weird cable and physical port interaction.

Check too all your firmware versions and such. Might help you identify why specific controllers aren't working like you'd expect.

_Gea · Feb 18, 2014

I have had a lot of problems with ESXi 5.5 and OmniOS - up to data corruption and OS freeze. I found that the problem was ESXi 5.5 and the e1000 vnic until I disabled TCP segmentation offload and LSO.

To verify, you can either use ESXi 5.1, try vmxnet3 vnic (needs vmware tools installed) or modify e1000 settings, see http://www.napp-it.org/downloads/index.html

You can also download my ready to use ESXi appliance (OmniOS + napp-it with ESXi tools installed) that is prepared to run with ESXi 5.5 (with vmxnet3 and modified e1000 settings) to verify if this is your problem too.

rthorntn · Feb 18, 2014

Thanks!

raiderj - totally agree, I was clutching at straws hoping the disks would come good under another OS, I will do as you say under Linux.

_Gea - I didn't even notice a napp-it ESXi VM, that's awesome, once I have isolated any faulty disks I will probably go with this option.

rthorntn · Feb 18, 2014

Unbelievable, regular SATA, bare metal Ubuntu 12.04 server, the two bad drives basically stop Ubuntu from booting with the same "buffer I/O error", oddly they were mounted in the "cage" right next to each other, that does mean that I didn't run SMART on them, the other four disk passed.

Brilliant (sarcasm) I just checked and I brought these drives in May 2011 from the USA over Ebay (i'm in AU) so I will have to send them to the guy in the US to get warranty...

Because they are basically new, is it risky that I could spend the dollars to send them back and HGST say that they were dropped in transit years ago?

UPDATE: appears to be fixed, using parted to add a gpt label to both disks got rid of the errors, SMART gives PASSED on all six disks connected to the LSI.

Question1: parted was showing 512 sectors I thought the Hitachi 2TB I have supported 4k, do I need to do anything?

Question2: should I run any other scans in ubuntu on these disks?

Thanks for the help!

UPDATE2 back to square one, kicking off short smartctl run on the troubled two fails on bare metal Ubuntu

raiderj · Feb 19, 2014

You might check out this thread on testing hard drives that I created not long ago: http://hardforum.com/showthread.php?t=1804983

Basically I picked up 8x4TB drives for my file server, and am currently finishing up a badblocks scan on all 8 drives. So far all have come up without errors or related SMART flags. Tells me the drives should be ready to put into my file server as a RAIDz2 array. The scan takes a LONG time to complete. 3TB drives took ~72 hours, and my 4TB drives are coming in longer than that, as expected.

The way I look at it, I might as well make sure the drives can handle a basic write/read test before I use them. If any of the drives fail, I'll save myself time and energy down the road to replace them now rather than wait until other errors crop up. Of course, they could fail the second I get them running with actual data, but that's why I have redundancy in place and of course backups.

To your question about sector size, chances are your drives emulate 512b sectors, but are actually 4096 physical sectors. Nothing for you to worry about, unless you create a ZFS pool. Then just set the ashift=12. If you get it wrong, it's only a performance hit that you may not ever notice in practice. Still, good form to match your physical sector size (again, if you use ZFS).

rthorntn · Feb 19, 2014

Thanks raiderj.

Does anyone have any opinions on whether I should put the troubled two in to a windows machine and run WinDFT?

UPDATE: WinDFT fails, fingers crossed HGST will accept two US disks in AU.

raiderj · Feb 20, 2014

Bummer. At least you know where the problem is.

My next build is going to very closely mirror what you're doing. I plan to set up an ESXi + ZFS all-in-one server and use my just replaced 6x2TB drives in a triple mirrored pair setup with some SSDs. Update the post with how it goes for you, be good to pick up on any lessons learned you have.

Best way to troubleshoot disk or adapter issues?

rthorntn

n00b

raiderj

Limp Gawd

_Gea

Supreme [H]ardness

rthorntn

n00b

rthorntn

n00b

raiderj

Limp Gawd

rthorntn

n00b

raiderj

Limp Gawd