Trying to read a specific file crashes my RAID card

Joined
Jan 3, 2009
Messages
645
I have a a LSI MegaRAID 9260-8i raid card. It was originally an IBM ServeRAID
M5014 card, but since those are just re-branded 9260-8i cards I
reflashed it, and it was working fine for a few years. I run Windows 7 Professional 64bit on the system.

It has four 3TB Western Digital RED drives connected to it using a SAS->SATA cable in a RAID5 configuration.

Lately I have been having many issues with the RAID controller itself crashing, the error logs keep mentioning that the firmware itself "detected a possible hang", or that it crashed and rebooted. Originally I thought this was a firmware issue since there was a warning about backpanes (which unless it sees the SAS to SATA cable as one, I am not using) causing problems with a recent update.

I had posted about it previously here: http://hardforum.com/showthread.php?t=1840958

However, after much trial and error attempting to backup my data, I found the source of the crash..... but I have no idea why this could make the controller crash or what to do to fix it.

I noticed that throughout the hundreds of folders, hundreds of thousands of files, all throughout the 8TBs of the array.... it is a single file that is causing this. I can access the entire rest of the RAID5 array indefinitely with no problems, but attempting to read around the 80% or so point of that single file causes the card itself to crash!

This makes no sense to me, isn't the whole point of a redundant disk setup and a dedicated controller card that it can manage if even an entire drive fails and warn you of this so you can replace it? (Assuming you aren't running a RAID0). So why then, would not even a bad disk, but a single FILE cause the card itself to actually crash? If the filesystem itself has corruption that should cause Windows to have a read error, or possibly crash, not the card right? And if it's a hardware issue with the physical drive then the RAID card should notice the read error and report that, not crash, shouldn't it? I know the issue isn't limited to Windows either since attempting to create a backup image using an Acronis boot disk caused it to crash when it got to that point as well.

I have no idea what to do. I really don't care if I have to delete the file, it's nothing important, but right now I am worried that even deleting the file would cause it to crash again, or if somehow it's not the file but that particular area of that one disk, then if I delete the file I will just have this problem again when a new file is written to that area. Or if it would even be wise to run a chkdsk on the array or if that would just cause the card to crash still when chkdsk gets to that area of the RAID5 (and then run the risk of chkdsk assuming it found a million errors and attempting to fix them, corrupting tons of stuff in the process, if the controller goes down while it's scanning). That is, if it even is because of the physical location of that file and not somehow the file itself.

Any suggestions? Would my card itself have any type of diagnostic or self-checking tools for this? Any idea what I can try to do to figure out why this is happening or try to fix it?
 
Can you do a checkdisk on the array? Maybe you just have a bad sector on one of the drives.
 
Is indexing enabled on that array?

What about the fragmentation?

It could very well be that it is some file system corruption that is causing some very rare buffer overflow on the RAID card causing it to crash.

Can you try cut/pasting that file to a different drive.. maybe with a Linux boot disk?

Do you have all your data backed up somewhere?

I bet running chkdsk /f and then defragging (and maybe disabling indexing) will fix your problem. But I would either move or get rid of that file first.

If it were a drive going bad, I would expect the RAID controller to pick it up.
 
Any clue what drive it is? Can you access the SMART info on any of the drives? If so, whichever one is different from the others needs to be replaced.

Ideally, your card shouldn't hang. But if there is no firmware update for it, then your next option is to fix the underlying problem.
 
Failed drive definitely,

Who knows what the drive itself could be doing, maybe some crazy drive firmware error that is shorting out your RAID card.

That's why everyone says that RAID isn't a backup, because crap does happen like this occasionally.
 
Is indexing enabled on that array?

What about the fragmentation?

It could very well be that it is some file system corruption that is causing some very rare buffer overflow on the RAID card causing it to crash.

Can you try cut/pasting that file to a different drive.. maybe with a Linux boot disk?

Do you have all your data backed up somewhere?

I bet running chkdsk /f and then defragging (and maybe disabling indexing) will fix your problem. But I would either move or get rid of that file first.

If it were a drive going bad, I would expect the RAID controller to pick it up.

Do you mean if indexing is enabled in Windows? Or are you talking about some kind of RAID function? Because no, I disabled indexing in Windows.

I haven't touched the default defragment settings however.

And I tried using an Acronis backup bootdisk, but it also crashed the RAID. Attempting to read the file crashes it when it gets to about the 80% point. I don't care about the file though, I am more worried would could happen if I tried to delete it or run a scandisk, as I mentioned.

And yeah, this is why I am confused, my card should be able to warn me or somehow manage a drive failing, yet a single file is crashing it, and yet otherwise if I don't touch that file it runs fine, It makes no sense.

Any clue what drive it is? Can you access the SMART info on any of the drives? If so, whichever one is different from the others needs to be replaced.

Ideally, your card shouldn't hang. But if there is no firmware update for it, then your next option is to fix the underlying problem.

The raid's health as well as all the drives on it as listed as being fine in the card's managment software.

The previous firmware update which I assume was what caused this because of that backpane (whatever that is) issue was released around September or so I think, but there was also a new update about 2-3 weeks ago that I was hoping would fix it, back when I assumed it was as firmware defect.
 
Copy everything off of the array, rebuild the array from scratch, and then copy everything back.

It really sounds like one of these:
Array is corrupt
Raid card has a bug
Drive is starting to fail
 
I already have a backup of the data just in case, that's how I found out it was a single file and not the entire drive actually, the backup kept failing on the same spot in the same file.

I eventually gave up trying to create a backup image and just basically dragged-and-dropped the entire array sans that one file onto a backup drive.

Though I do want to find out what is causing this, since if it's something physically damaged on one of the drives restoring a backup would just make it happen again with a different file.
 
Copy data back to the array once it is rebuilt on the same drives (if your logs show that these drives have not reported SMART or timeout/read-errors)

Once data is copied back, defrag the drive - check disk it, extended benchmark it; go nuts. If it shows signs of failure or inconsistency (with no drives reported in logs as failing/failed) then your card is likely failing.

Keep your backups through all of this.
 
Back
Top