Data corruption on Perc 6/i RAID array

charold · Dec 7, 2015

I'm trying to work figure out why I keep getting data corruption on this T310 server with a Perc 6/i RAID controller. These are brand new disks, all updated firmware on server (from EFI, lifecycle, RAID controller, etc)

4 x 2TB Seagate ES SATA disks purchased June 2015

No SMART errors, no error logs in iDRAC, nothing of note anywhere other than repeated boot issues and data corruption.

Basically, if I had to reboot because of updates, or drivers etc, BCD would get corrupted. I can rebuild the BCD and get it to start to boot, but there is always a file corrupted. NDIS.sys for example last time, but other files before have been corrupted and caused the system to fail to boot.

I can't find any errors, logs of note, hardware failing. Drives show SMART status as good, RAID 10 array is healthy.... not sure where to go at this point other than replacing parts willy nilly.

charold · Dec 7, 2015

http://www.newegg.com/Product/Product.aspx?Item=N82E16822178296

^ Those drives specifically are installed, 4 of them in RAID 10.

Spartacus · Dec 7, 2015

What OS?

I had a customer with a Dell server running 2003 that started getting random corrupted files. It was about the worst possible failure to have. It started with a single Excel spreadsheet and snowballed to losing critical files like large Quickbooks company files and large proposals. Customer was not happy.

We had good backups, but it was really hectic keeping up with file restores and trying to figure out the problem.

I went through everything.... possible controller problems, BIOS updates, replaced the RAID-1 array with new drives, 2003 patches, etc. the problem persisted.

I finally ended up replacing the six year old server with a new HP server running 2012. After that everything was fine.

It appeared that it may have been the 2003 OS itself where the failure was happening with a specific set of circumstances. There were patches for 2003 relating to file corruption, I applied them but it didn't help. It may also just have been the mobo or disk controller getting old from running 24/7 for six years. One odd thing is that the smaller OS partition never got corrupted, only the larger 400GB partition.

I've also had a couple of Dell disk controllers with bad/leaking caps on them fail. Pull the controller and examine the caps.

Good luck, random file corruption is the worst possible failure on a server.

Edit:

Check the Event logs, I had "Volume is corrupted" errors and telling me to run chkdsk. Did run chkdsk many times but the fix was only temporary.
There is also the remote possibility of one of those hostage-ware viruses that encrypts data, those are nasty.

Edit#2:

Looking back at the issue I had, I should have swapped out the disk controller. Logically it makes the most sense as the point of failure (after hard drives).
I don't have a warm fuzzy feeling on Dell disk controllers anyway. The customer was due for a new server anyway though.

PliotronX · Dec 7, 2015

You might try toggling write cache policy on the individual drives and/or no read ahead and perhaps the BMC the latest f/w.

MrGuvernment · Dec 7, 2015

Have you tested individual drives with different tools?

Verified all cables are good..

Tested Memory

On that note make sure you have backups being done perhaps twice a day to make sure the data is safe.

Nate7311 · Dec 7, 2015

^ What he said, but make sure they are real backups, not just Shadow copies.

Spartacus · Dec 7, 2015

Yep, what these guys said about backups.

I found that it was critical to mange backups properly during that catastrophe.

I set a specific backup RD1000 cartridge as read only for a while because it was our latest
backup before any signs of data corruption. We went back to that cartridge many times for
restores. I did keep running new backups but found I was sometimes restoring corrupted
files. Then had to drop back one backup at a time until I could get a good file to restore.

The customer was not happy when she did payroll for about 60 employees and then the
server immediately corrupted Quickbooks before a backup could be made. They had to
call the bank and undo all of the direct deposits and then re-do the payroll in Quickbooks
running on a workstation. That was before we realized how serious the problem was.

That day was the deciding factor to replace the server. They just had to stay out of their
data for two days while I located a suitable new server and installed it. Seemed like it was
impossible to find new servers that week too, it was crazy, everybody was out of stock.
Finally got the HP from a big reseller but had to drive 100 miles to go get it.

Dell said wait time was six weeks on a new server.

EvilAlchemist · Dec 8, 2015

MrGuvernment said:
Have you tested individual drives with different tools?

I preface my reply that I have a Perc 5/i and WD Drives but:

Test each drive off the raid controller and make sure the drive is not having issues.

Does the data corruption happen if you boot, write data and reboot within 5 minutes? Maybe some sort of spin down is happening.

WD has TLER issues sometimes. Not sure about this brand ?

charold · Dec 8, 2015

Thanks for all the suggestions.

I had already started running a loop of Memtestx86 after running through the gamut of built-in Dell HW diagnostics, so far with 13 passes I have no errors. Next step is probably individual drive testing, which I was hoping to avoid, but is going to happen after I get through a full 24hrs of memtest.

What do you guys recommend for thorough hard drive testing? I've noticed the data corruption tends to be towards the beginning of the volume generally, so I was considering doing running HDD Regenerator on each drive individually.

I bought Enterprise drives so TLER (CCTL in Seagates case) shouldn't be an issue. I haven't seen any issues with the array health, no drives dropping from the array etc.

This, incidentally enough, was my backup server. Luckily, I have backups of backups, adn tape backups offsite, and multiple snapshots and copies of critical data, so there literally hasn't been a day of no backups with the level of redundancy I built into the system. Roughly 80% of all storage I have available to me is involved with backups, copies, off-site storage etc, with replication and backup copy jobs. I honestly couldn't be happier with the overall setup, PM me if you have any questions on how it's setup.

Spartacus · Dec 8, 2015

Interesting that you mention no problems with the array, it was the same with the case I mentioned.
The corruption had more to do with logical errors in the 2003 OS I think, not really a hardware failure.

I hope you get 'er figured out, please let us know what you find.

PM sent on backup strategy, I always like hearing about new/better ways to do stuff.

PliotronX · Dec 8, 2015

Spartacus said:
Interesting that you mention no problems with the array, it was the same with the case I mentioned.
The corruption had more to do with logical errors in the 2003 OS I think, not really a hardware failure.

I hope you get 'er figured out, please let us know what you find.

PM sent on backup strategy, I always like hearing about new/better ways to do stuff.

Not to derail, but how do you guys deal with MF users writing novels in path names screwing up backups?

djstarfox · Dec 8, 2015

I would disable all write caching 1) on the controller, 2) in the OS. Then, re-test.

charold · Dec 8, 2015

djstarfox said:
I would disable all write caching 1) on the controller, 2) in the OS. Then, re-test.

On the controller it's already disabled, and if I reinstall the OS I will double-check that setting, but I believe often-times that's not a configurable option on the OS. Usually it says something along the lines or, this option is not configurable. Correct me if I'm wrong.

djstarfox · Dec 9, 2015

What OS are you running? Did you say 2003?
https://support.microsoft.com/en-us/kb/324805

If not, let me know.

Could also be bad RAM on the controller.

charold · Dec 10, 2015

Sorry, had a day off yesterday, but I'm running 2008 r2 SP1 currently. I ran through Seagate Seatools on all the drives (well my colleague did), and all tested back good. I found a few references to using large GPT volumes for Server 2008 r2 boot volume/windows volumes. So I did a sliced RAID array, temporarily, and am going to write/reboot/test rinse and repeat for a week or so and see how it goes.

I am reinstalling Veeam backup, and this is just going to be for secondary replication, and to dump to tape while I do more testing.

Edit for clarification: I did a sliced RAID array and made a VD for the OS and did basic MBR formatting on that, and left the server BIOS to legacy boot, and then formatted the other VD as a GPT volume to get the whole space. So its 150GB for OS, and the other 3.5TB for the data volume.

Spartacus · Dec 10, 2015

>>Edit for clarification: I did a sliced RAID array.....

So this is a change you just made to try to remedy the problem correct?

The boot volume was a 4GB GPT volume before?

Hopefully that fixes it for you.

charold · Dec 11, 2015

Spartacus said:
>>Edit for clarification: I did a sliced RAID array.....

So this is a change you just made to try to remedy the problem correct?

The boot volume was a 4GB GPT volume before?

Hopefully that fixes it for you.

Thats exactly right. I've seen a couple things of people having issues with large GPT volumes and Win 7/Server 2008 R2.

The only problem now is because I haven't found a root cause, I'm really just making an educated guess at this point and testing is that I'm not sure if I can trust this server again in a full production environment. I'm just going to keep testing and figure out what I want to do with it.

charold · Dec 17, 2015

Just an update, so after extensive Copy/read/write operations on both volumes, I have not seen any more data corruption or any errors in the event log. I'm going to continue testing through the holidays (at least through Christmas).

I'm debating whether I want to get a couple of 120GB SSDs in RAID 1, but I don't have a good place to mount them really, other than buying a 5.25" bay that has 2.5" slots (like the Icy Dock ones). I can buy and install a SATA expander for th ePerc6/i controller, and install an icy dock with a couple of SSDs. The OS has very little transactions, it's all primarily on the spinning drives, and the Veeam backup is installed on the spinning drives as well.

Any other suggestions?

Data corruption on Perc 6/i RAID array

charold

Limp Gawd

charold

Limp Gawd

Spartacus

2[H]4U

PliotronX

2[H]4U

MrGuvernment

Fully [H]

Nate7311

2[H]4U

Spartacus

2[H]4U

EvilAlchemist

2[H]4U

charold

Limp Gawd

Spartacus

2[H]4U

PliotronX

2[H]4U

djstarfox

Gawd

charold

Limp Gawd

djstarfox

Gawd

charold

Limp Gawd

Spartacus

2[H]4U

charold

Limp Gawd

charold

Limp Gawd