Supermicro Megaraid 3108 degraded cant boot windows or bootable sticks(blue screen)? refs volume failing

markm75

Gawd
Joined
Sep 12, 2008
Messages
856
I have a server 2019 box, its a supermicro sas chassis with an 8 drive raid 6 array on this hardware megaraid 3108.
What happened originally here is that i found windows was stuck, black screen.. so i reset the machine, then i found that it hangs on the supermicro logo screen with windows spinning dotted circle.. num lock unresponsive.

So i checked the array in the bios and found 2 drives missing.. i moved those 2 (possibly bad) drives to another set of slots, they went to foreign mode and then started rebuilding.

Firstly, im not sure why a degraded array would prevent windows from booting.. but even worse is the fact i've tried 2 usb boot sticks that have been known to work fine in the past and they have issues too..
I can get the usb stick to boot but only to a blue blank screen.. shift f10 i can get to command prompt.. if i type setup and hit enter it says setup is starting but sticks..

Back to the prompt, if i do diskpart (with or without the usb stick in), it hangs at machine: miniwanpc (forget the spelling but i guess the booted usb os).. never shows anything but i can shift f10 other prompts.
I've already ran a chkdsk on C prior.

Im really at a loss here.. (its our dpm backup server with all data on the one raid 6 array, the OS was carved out of it from the original setup).
Not sure what to do, i do note that the 2nd "bad"/recovered drive is at 45% on rebuild, the first drive fixed itself.

I'd think at least a boot stick should work, but i cant even get that far. Maybe one or both of these drives are causing some sort of cascading failure (last ditch will be to pull those two and try again, or disconnect the megaraid and try the usb stick with no drives basically)

Any thoughts or suggestions?
 
Unplug all drives except what is needed to boot, see what happens.
 
Of your boot drive is carved out of the raid 6, and that's having issues, probably best to wait for that to finish.
That megaraid is built into the motherboard?
 
Of your boot drive is carved out of the raid 6, and that's having issues, probably best to wait for that to finish.
That megaraid is built into the motherboard?
Megaraid is its own card on the motherboard pcie slot.
Yeah i guess no choice to wait and see
 
Unplug all drives except what is needed to boot, see what happens.
Well in this case the boot drive is part of this addon megaraid card, raid 6 array thats still rebuilding.. but i've never had an os not boot during a rebuild phase, to me its strange.

I guess i could turn it off, remove the card, see if the usb stick then boots without sticking at the blue screen or shift f10, diskpart test.. atm diskpart hangs never does anything
 
First, for the USB stick problem if you just pull the MegaRaid card out and try to boot of the USB stick, does that work properly? Is your card an 8-port or a 16 port? If a 16 spin up a temporary drive, install windows on it and set it to boot of that drive, that will give you a better overview of the card and its issues without the degraded array as a boot volume in the equation.
 
update on what i did, the drive array finished rebuilding.. i then tried to reboot, reboot to windows 2019 or to the stick, either way no go.. it still got stuck at the spinning dotted circle (continues to spin nothing happens, black bios screen).. to the usb boot stick, it goes to blue screen, not the setup window (can do shift f10, diskpart hangs).

Then i opted to take and pull out the 2 "bad" drives and try booting to the stick, same issue, hangs, diskpart gets stuck.

So i then pulled out the 8 sas drives in the array and booted to the stick.. no issue.

I then put a standard 7200 sata drive in the chassis, booted to the stick, no issue.

I put the all 8 drives in the raid6 array back in the chassis for now, i had to import the foreign configuration, but now the 2 possibly "bad" drives are rebuilding yet again (took 1+ days last time)

This is really strange, so a raid6 volume, virtual drive could hang a boot stick or windows in general? If its not one of the two drives hanging it, i guess it could be other drives or just the mere presence of that virtual volume or corruption in it perhaps? Never ran into this before.
 
Welcome to hardware RAID in Windows! This is why I prefer the Areca cards to the LSI branded models, the Areca has many more options for monitoring (Ethernet so even if the machine is hung you can get to the card) and much better facilities for repair.
 
I've used areca on many of our other servers in the past, yeah, some regrets now.

Not sure what else to do to move forward.. i thought maybe i could pull a few more of the drives from the array, as maybe one of them is causing a hang, but i'm afraid to pull any more with the array in rebuild state on 2 drives as is.. not sure if i shut down, yank 1 or 2 other drives and try a boot, if when i later put back in if it will just go back to its normal state or not
 
Actually matters just got worse.. i read online somewhere that someone suggested turning off option rom in the bios to at least be able to get to the usb stick and see if the volume shows. however, i opted to turn it off for all 3 pciex16 slots, only 2 in use, video card and the raid card.. now when i boot up the machine its stuck at a black screen, nothing comes on.. i thought maybe i borked it, so i pulled out the battery backup to reset the bios and unplugged it for 10 sec, same difference, the motherboard being Supermicro DP x10DAI-B

Im now scrambling to think of a way to buy maybe a nas quickly, then attach it to a hyperv guest to get DPM going again, i think that should be possible
 
Try pulling the card drives and all, and see if the video comes back up. The video card shouldn't stall with OROM turned off and I expect you mean Legacy OROM.
 
I was able to clear the bios, all ok bios wise now.

I checked the smart info, no errors on any drive which is odd.
I would have thought if it were the two drives preventing the system from booting to a usb boot stick or to its windows on that virtual raid6 drive, that when i pulled them it would have booted (pulled the two drives that i thought were bad), but that wasnt the case.
At this point its rebuilding again (the two new drives due tomorrow), so i cant really pull other random drives of the 6 other drives to figure out if another is causing the hangs.
im pretty sure its going to boil down to getting those 2 new drives, it not working then wiping out the virtual array and starting over (loss of data most likely at least on the backed up server data side, though i may have a tape backup of the actual dpm server data)
 
Last edited:
Look into the log and see what the error was. SMART failure/prefailure... Device Timeout... If the drive(s) tried to recover on their own from a simple crc error and it took longer than the card allows it will drop the drive(s) (with the expectation that it is failing and that one of the spares would take its place and begin a rebuild. That is why NAS/enterprise rated drives are so important, especially in hardware raid. WD calls this feature TLER, Seagate calls it ERC. And for hardware RAID it can make a big difference, dropping problematic drives instantly so a rebuild with a spare can start immediately.
 
Look into the log and see what the error was. SMART failure/prefailure... Device Timeout... If the drive(s) tried to recover on their own from a simple crc error and it took longer than the card allows it will drop the drive(s) (with the expectation that it is failing and that one of the spares would take its place and begin a rebuild. That is why NAS/enterprise rated drives are so important, especially in hardware raid. WD calls this feature TLER, Seagate calls it ERC. And for hardware RAID it can make a big difference, dropping problematic drives instantly so a rebuild with a spare can start immediately.
If by log you mean in the lsi megaraid bios area, im not seeing any records for some reason.

The drives are exos 8tb seagate enterprise drives in this case (ST8000NM001A)
 
Ok, then the drive(s) didn't drop because they held up the card. There are a lot of other reasons the drive could timeout.
 
Slight progress.. during shift f10 to see a command prompt, i can in fact see the C and D drives that are carved out of the raid 6 array.. but when i try to get to the refs E: drive it just sits at the prompt, much like it gets stuck when trying diskpart. I'm pretty sure the refs volume is corrupt in some way, which is somehow preventing the C drive and machine to boot, without a way to boot into a usb stick that allows partition deleting/fixing, i see no way around the situation but to redo the entire raid6 array.
 
The LSI shouldn't do that, even if you have a corrupted non-boot partition as long as the boot partition is valid it should boot. I have had circumstances where this was the case and as long as boot was valid it would come up. The LSI can also have issues where at least one of the volumes was expanded after the initial creation due to the way it deals with the virtual disks. Do you have a backu p of the E drive or the entire volume? As flakey as this has been, the safest way would be to move the data off to a temporary volume, blow away the current Raidset and recreate it. if not, you may want to try something like RaidReconstructor or GetDataBack from Runtime to see if it can salvage the data from the volume. You might also put a ticket in with LSI support and see if they can offer any ideas, they know their cards much better than I do, and they might have come across this particular problem and have a solution. I would do this all before I wiped and recreated the array, losing anything you don't have a backup of or couldn't copy/salvage from the volumes. I wish I had more to offer .
 
The LSI shouldn't do that, even if you have a corrupted non-boot partition as long as the boot partition is valid it should boot. I have had circumstances where this was the case and as long as boot was valid it would come up. The LSI can also have issues where at least one of the volumes was expanded after the initial creation due to the way it deals with the virtual disks. Do you have a backu p of the E drive or the entire volume? As flakey as this has been, the safest way would be to move the data off to a temporary volume, blow away the current Raidset and recreate it. if not, you may want to try something like RaidReconstructor or GetDataBack from Runtime to see if it can salvage the data from the volume. You might also put a ticket in with LSI support and see if they can offer any ideas, they know their cards much better than I do, and they might have come across this particular problem and have a solution. I would do this all before I wiped and recreated the array, losing anything you don't have a backup of or couldn't copy/salvage from the volumes. I wish I had more to offer .
Those are some good ideas to try at least.. on the backup, well technically the data in the partition that wont load is the backed up data (DPM data) from every server, so losing that sucks (40tb) but it can redo that over time if i get this working

But a slight update, I installed windows 2022 to a new partition on the motherboard controller.. set the mega raid to off for opt rom so it didn’t appear as a boot option.. hence I was able to install windows.

In windows straight away there is a device error on the avago raid controller, error code 10 could not start.. if I try to update the driver to the 2017 driver version (only one I see), it then blows up the server into a bsod/crash/restart.
Bug check was 0x0000007e on the bsod
 
Sort of off-topic, but I'm curious. Why would you carve the OS drive out of the RAID, as opposed to just putting it on its own dedicated drive? Seems like doing that would make you extra vulnerable to a situation like this.

The fact that two drives failed at the same time leads me to suspect the problem could be the controller. Do you have a spare controller you can swap in place of the old one that could then import the foreign RAID config?

I used to have a field workstation that had two great big monster raid arrays, each with its own dedicated LSI controller. I always kept a spare controller controller card in the parts box, alongside a few fresh drives, just in case one of the controllers failed. I tested it once, and I was able import the foreign raid config on the new controller, and get the volume back up and running that way.
 
Sort of off-topic, but I'm curious. Why would you carve the OS drive out of the RAID, as opposed to just putting it on its own dedicated drive? Seems like doing that would make you extra vulnerable to a situation like this.

The fact that two drives failed at the same time leads me to suspect the problem could be the controller. Do you have a spare controller you can swap in place of the old one that could then import the foreign RAID config?
There are a number of reasons. First, you may want to have redundency on the OS drive and not want to have to put another pair of drives in the box (or not have the available drive slots.) it could be for regulatroy compliance (backing up the single volume to a WORM LTO tape for example, required in some legal, medical and government/contracting roles. If your array is external and your server eats itself, it is much easier to just plug a SAS cable to a new host and reboot, versus restoring from a backup of the os which might not be up-to-the-minute complete and depending on your environment and backup schedule may be an issue as well. I am sure there is more but that is what I had off the top of my head.
 
Bug check was 0x0000007e on the bsod
Is SYSTEM_THREAD_EXCEPTION_NOT_HANDLED and is 99% of the time a driver issue. Again, as I said earlier in the thread it could be a bad card, or more likely the base Windows Server 2022 has an outdated driver. See if you can successfully update the driver and see if it loads.
 
Sort of off-topic, but I'm curious. Why would you carve the OS drive out of the RAID, as opposed to just putting it on its own dedicated drive? Seems like doing that would make you extra vulnerable to a situation like this.

The fact that two drives failed at the same time leads me to suspect the problem could be the controller. Do you have a spare controller you can swap in place of the old one that could then import the foreign RAID config?

I used to have a field workstation that had two great big monster raid arrays, each with its own dedicated LSI controller. I always kept a spare controller controller card in the parts box, alongside a few fresh drives, just in case one of the controllers failed. I tested it once, and I was able import the foreign raid config on the new controller, and get the volume back up and running that way.
Well originally i carved it out, due to the speed of a raid6 read rate on 8 drives vs just a single sas 12 gbps for the OS, hindsight now, i think id rather just plop a pciex4 adapter card and go nvme for the OS going forward.

I didnt have a spare lsi 3108 laying around though now i'm going to order one.
 
Is SYSTEM_THREAD_EXCEPTION_NOT_HANDLED and is 99% of the time a driver issue. Again, as I said earlier in the thread it could be a bad card, or more likely the base Windows Server 2022 has an outdated driver. See if you can successfully update the driver and see if it loads.
I tried to update the driver manually, that would hang sometimes, even picking the 2017 driver (which is basically the same driver) i had access to.. the one only line only is 2017 as well. Doing windows update option on the driver says the best is already installed, in that situation.
 
So another update. Im really not sure what the situation was exactly.

I moved the card from one slot to another, then moved the drives in the chassis around, suddenly I could boot to the original carved 2019 OS. But the E drive which was a refs drive, was corrupt and “raw”, the backup data 43TB location basically.

I moved the card from one slot to another, then moved the drives in the chassis around, suddenly I cCOULD boot to the original carved 2019 OS. But the E drive which was a refs drive, was corrupt and “raw”, the backup data 43TB location basically. (I booted to safe mode first, then rebooted, then it said some updates failed to install correctly and was reverting)

**Is it possible that a REFS volume would cause things to not boot? Of course before all this, changing the opt rom to off originally allowed me to boot to a new OS on a new motherboard drive and I saw the card was in error code 10, exclamation mark, hardware error it said, which was strange in itself.

Or do I have a latent failure in the card possibly? Either way maybe ordering a backup card is in order or replacing it either way? Or could it be something odd with the motherboard or chassis that lead to corruption…

Those two are my leading theories as mentioned before, latent failing card, now working again, or REFS corruption lead to the whole thing not booting (odd)
 
Well, you can;'t trust Windows Update telling you that the best option is installed, because it is often (especially for drivers where the manufacturer actually updates it) dead wrong. With all the issues and what you have tried already... I would replace the HBA.
 
Well, you can;'t trust Windows Update telling you that the best option is installed, because it is often (especially for drivers where the manufacturer actually updates it) dead wrong. With all the issues and what you have tried already... I would replace the HBA.
Yeah i never trust them but tried all options, its strange 2017 is the most recent update though. edit: i did find 6.714.18.0 dated 2018, so either way im updating to that and maybe ill try that 2022 drive one more time to see if it cures 2022 (otherwise 2022 isnt compatible).

Changing for the same one might be the ticket, but perhaps theres a deeper issue with MS DPM or even veaam, as...

Im seeing things like this though related to refs going raw:

https://www.reddit.com/r/Veeam/comments/fwqeyy/refs_corrupt_again/
and “It's a RAID controller issue: it doesn't process flush command correctly, which causes file system metadata loss with bad reset/BSOD/network glitch timing.”


So perhaps an underlying issue is the 3108 isnt on the hcl list for DPM (or even for veaam once we move to it) as well

Not sure if this event caused a non boot situation, feel like it was still hardware somehow too, or both.
 
Well, we have Veeam backup and replication and MS data protection manager running on a lot of boxes with Areca and LSI based controllers (or their branded enterprise OEM cards) and have had no issues we can trace to the HBA. A lot of things not on the QVL or HCL lists can still work perfectly, but depending on the vendor won't entertain upgrading your problem to engineering if every part of the chain isn't HCL'd. The best thing you can do to alleviate HBA problems is use an "enterprise rated" motherboard.. Supermicro, Tyan etc. More than 50% of the problems I try to help with here are consumer or prosumer boards which have a bit of PCIe voodoo going on everything from conplete black screens to the card panicing every 129,600 seconds.
 
Well, we have Veeam backup and replication and MS data protection manager running on a lot of boxes with Areca and LSI based controllers (or their branded enterprise OEM cards) and have had no issues we can trace to the HBA. A lot of things not on the QVL or HCL lists can still work perfectly, but depending on the vendor won't entertain upgrading your problem to engineering if every part of the chain isn't HCL'd. The best thing you can do to alleviate HBA problems is use an "enterprise rated" motherboard.. Supermicro, Tyan etc. More than 50% of the problems I try to help with here are consumer or prosumer boards which have a bit of PCIe voodoo going on everything from conplete black screens to the card panicing every 129,600 seconds.
Ah ok, strange then. All our servers have supermicro motherboards x10dbai in this case i think it is
 
So i decided to reboot to the 2022 drive.. strangely no exclamation mark in windows device manager at all.. all drives showing up.

Makes me really wonder whats going on here, i guess random hardware fails or that refs volume somehow barfing the driver when it was corrupted (doubt it).


Then after rebooting back into the old 2019 OS, its doing record/index corruption repairs on a check disk.. it did warn before rebooting there were "errors" to be corrected, those could be from all the attempts at booting.
Now i dont know how much to trust the OS in tact, maybe doing a sfc /scannow is in order if thats reliable.
 
Also, in the manager for the controller card in windows i found a ton of "informational" errors from even 3 weeks ago.. not sure where/why they werent in the bios version of the card..

many about unexpected sense PD, corrected medium error during recovery, PD 2, 4 etc
 
So I tried updating to the 2020 firmware since its zip wasn’t corrupt (the 2022 supermicro lsi was corrupt).. I ran it then rebooted (did using the gui).. on reboot I got hit with a ton of errors
Maybe these are normal after a firmware update? But then again the ones where it says missing don’t look good at all.. I assume PD removed :21 means slot 21?

23, 21, 20, 4, 3 all seem to have issues? The slots, the drives, the card, unsure?
Edit: actually i have slot 0,1,2,3,4,5,18,19,20,21, no others; 20 and 21 might be the only issues here.



What a mess:

Conistency check started on VD Image Details---
BIOS Version : 6.36.00.3_4.19.08.00_0x06180203 Firmware Package Version: 24.21.0-0126 Firmware Version : 4.680.00-8519

Controller ID: 0 PD Reset: PD
= -:-:21, Critical
= 3, Path =
0x5000C50086859CA5
Event ID:268


Controller ID: 0 Fatal firmware error:
Line 315 in ../../raid/mon.c
Event ID:15


Controller ID: 0 Fatal firmware error:
Line 245 in ../../raid/1078int.c
Event ID:15

Controller ID: 0 Fatal firmware error:
Driver detected possible FW hang, halting FW.
Event ID:15

Particularly this one is concerning.. PD 21 ? removed:

Controller ID: 0 PD removed:
-:-:21
Event ID:112
Generated On: Mon Jul 11 15:13:20 EDT 2022



Controller ID: 0 PD Reset: PD
= -:-:4, Critical
= 3, Path =
0x5000C5009472E3C5
Event ID:268


Controller ID: 0 Patrol Read aborted on PD: -:-:21
Event ID:445
Generated On: Mon Jul 11 15:13:21 EDT 2022

Controller ID: 0 PD removed:
-:-:4
Event ID:112

Generated On: Mon Jul 11 15:13:40 EDT 2022
Controller ID: 0 VD is now DEGRADED VD
0
Event ID:251

Generated On: Mon Jul 11 15:13:40 EDT 2022
Controller ID: 0 Patrol Read aborted on PD: -:-:4
Event ID:445

Generated On: Mon Jul 11 15:13:41 EDT 2022
Controller ID: 0 PD Reset: PD
-:-:4, Critical
= 3, Path =
0x5000C5009472E3C5
event ID:268

controller ID: 0 PD removed:
-:-:3
Event ID:112
Generated On: Mon Jul 11 16:11:14 EDT 2022

Controller ID: 0 PD removed:
-:-:23
Event ID:112
Generated On: Mon Jul 11 16:28:51 EDT 2022


Controller ID: 0 PD missing
SasAddr=0x5000c50086859ca5, ArrayRef=0, RowIndex=0x3, EnclPd=0x00, Slot=20.
Event ID:257
Generated On: Sun Jun 20 19:40:18 EDT 1999


Controller ID: 0 PD missing
SasAddr=0x5000c5009472e3c5, ArrayRef=0, RowIndex=0x4, EnclPd=0x00, Slot=21.
Event ID:257
Generated On: Sun Jun 20 19:40:18 EDT 1999
Controller ID: 0 PDs missing from configuration at boot
 
Last edited:
on doing a sfc /scannow, it got to 67% then said windows resource protectioin could not perform the requested operation
 
If I were in your shoes I'd be contacting a company called DriveSavers. Used them four times and they came through three of the four times. One was a RAID 5 set. You only pay if they can recover the data. It has cost me between $200 and $4000 for the recovery. The one drive they couldn't recover had a head crash and the head scraped all the magnetic material off the platter. The $4000 repair they had to take the platters out in a clean room and move them into some other HDD or machine.
 
At this point i'd be building a new server, and using a software raid solution rather than card. Also theres some basics missing here:

Im not sure the size of your company but in IT we always kept the following:
All critical servers have a spare hot swap PSU on shelf, enough ram to boot in an emergency, and at least 2 cold spare disks.

I also dont recommend moving multiple drives at once, sometimes you can but safe practice is 1 drive at a time.

You could have a bad raid card, you could have a board with flaky/dying PCIe ports, I would suggest in the future having a "spare" of the same type, with no ram/disks in it, something u can swap parts over to for testing.

oh also.. RAID IS NOT A BACKUP!
 
So read a lot of what happened and just some notes from me:
1. When a raid array is flaking out, do not just go yanking out drives from cages and putting them in others, or removing ALL drives, even good ones - Seen this tank entire arrays before cause the raid card decides it recognises nothing, and then you got nothing to work with. Always have spares on hand, remove 1 drive, replace, see what happens, work with it in that manner. Let rebuilds finish rebuilding before moving on.

2. Your backup server, should also be backed up to something else, somewhere else., ALWAYS

A single backup server, is not a backup server...it is a single point of failure, RAID, is not a backup, cause it is your DSP , I presume this is for a company or your work? This experience, is how you get them to invest in a better backup strategy.
 
Also, in the manager for the controller card in windows i found a ton of "informational" errors from even 3 weeks ago.. not sure where/why they werent in the bios version of the card..

many about unexpected sense PD, corrected medium error during recovery, PD 2, 4 etc
So your card was reporting errors but do you have any notifications set up to get emailed said errors? Do you have no health monitoring on said server to report when drives die? Any system made in the last like 20 years has monitoring options usually build into either the card, or via the software installed into the OS that goes with it.
 
Back
Top