Help with 4p opteron shutting down

terrabyte89

Weaksauce
Joined
Nov 6, 2004
Messages
120
Hi everyone! Hoping I might be able to get some help from the knowledge vault that is horde :D.

I have a 4p opteron build here that I bought a little while ago second hand. Running a supermicro H8QGi+-f board and 4 opteron 6170 chips. The board had one cpu socket with a snapped pin (memory channel pin). I tried repairing it but didn't really get anywhere.

Anywho, I'm finding the whole system is powering down after around 18 hours running full load. (boinc). Sometimes it'll last 24 hours, other times it will do it in less than 12. So instincitevely I pulled the cpu from the borked socket but still did it running 3p :/. Knocked it down to a 2P. and it took nearly 48 hours but it still powered down.

Checked the bios, there's no alarms set or anything like that. IPMI is disabled. She passes memtest on all sticks of ram, windows logs show nothing except an unexpected shut down when I power it back on. Board logs are also empty. I'm just about out of ideas!

I've tried a 1200w silerstone Strider and also a 1200w Antec quattro psu, both of which should easily power this. Was running the OCNG5 bios but have since flashed back to factory and it did it on both.

I can't really check the temps unfortunately because all it shows is Low/Medium/high instead of the actual temp :confused:. The only temperature value I can see is the system temp which hovers around 30c. This I also find odd as my other H8QGI showed actual temperature values!

I've just booted up linux with boinc to see if it does it in that too (Currently have been running server 2008R2).

Any help with this is much appreciated!! Thanks :)
 
My first thought goes to proper cooling. Can you describe how you are currently cooling the board? I would recommend blowing air over the entire board. Both top and bottom.
 
Hi Gilthanis, I'm currently running 4xCM 212's for CPU cooling, the board was kept open air (not in a case) in an air conditioned office, so I assumed that was enough. But you could definitely be right, maybe it needs more airflow across the board. I will add a couple of 120mm fans on the bench blowing across it and report back! Thanks for your help :)
 
Yeah.. the mosfets and other components can get really hot. Especially if OC'd. I would raise the board off the table with some kind of standoffs and blow the air past both portions.
 
You've ruled out PSUs, I'd think. My next guess would be motherboard component(s) overheating. Have you observed the same problem if the system is just idling?
 
Also a great idea Linden thanks. I haven't actually tested it idling.

I have added some 120mm fans blowing cross ways over the board. Not sure if it's quite enough airflow. Will test like this, then I'll try the idling. And if it doesn't do it I might try and find a desk fan to blow over the entire board.

Thanks for the ideas guys, just frustrating testing when it's sometimes 24+ hours each time testing something new!

edit: Should also mention, chips are @ stock speed, and tested in linux and it does the same thing, so definitely hardware related and not OS.
 
frustrating testing when it's sometimes 24+ hours each time testing something new!
I don't envy you, that. :(

The computer's configuration: Do you have it sitting flat? If it's a naked system, do you have the motherboard elevated from the resting surface? Airflow is critical.
 
I'm guessing you are running with all the memory slots populated. do you know which memory slot is associated with the broken cpu pin. ? If you do try removing that stick of memory. Also I would recommend you flash back to OCNG and use the tools that come along with it, there are several tools in there that are well worth it when it comes to diagnosing problems. and you will be able to see your temps again :)
 
Hi Grandpa, I actually pulled that CPU and all memory and it still did it, so I don't think it's an issue with that socket.

I lifted the board off the desk with some CD cases and added 3x 120mm fans blowing around the board. Still no dice :(. Shut down overnight. Next step I'll try Linden's suggestion and just leave it idling and hope it doesn't shut down. If that's the case I might try a pedestal fan blowing across the whole thing!

Thanks for the help everyone!

edit: Also does the OCNG bios turn on detailed CPU temps? I thought when I was running it, it still showed the system temp instead of individual CPU's?
 
When you install it and the tools run (sudo tpc -temp) it will give you the cpu temp's (sudo tpc -dram) will give you the memory speed, size and timings of the individual sticks. (sudo clockspeed) will give you your current cpu clock speed (sudo voltcheck) will give your cpu voltage.
 
Boards dead :(.

Won't boot at all now. Powers up for half a second and switches off with CPU installed :(. Tried both psus same thing. Well I guess I'm on the lookout for a new opteron board :(
 
Umm there is a good chance SM will repair it for a reasonable price, they replaced a cpu socket on one of my H8QGi+-f for $50 so you may want to give them a call.
 
Thanks for the suggestion grandpa, but unfortunately now that the whole board is dead I have no idea what else would need replacing as well as the CPU socket. Also the shipping costs would probably kill the value :(

New board arrived yesterday! Another H8QGi+-f. Currently load testing it! Fingers crossed this one is fine. I'm running on the assumption the last board had a fault though.

Hopefully will get back to some regular DC production now :D

The only odd thing is I'm still not getting proper cpu temps. Even with the ocng5 bios loaded it's still showing around 30c per CPU. Which definitely isn't right when it's at 100% load! I think it's just using ambient temp sensors on the board or something

edit: Oh yeah I still have these to go on at some point hehe :p
 
the screws that came with mine water blocks do not fit the supermicro board. searched high and low for compatible screws to no avail, ended up finding some obscure old screws for a guetto solution.
 
Sorry for the late reply! That's dissapointing if that's the case with mine too. But luckily I have all the studs from the current 212 heatsinks that I can reuse if need be :)
 
Boards dead :(.

Won't boot at all now. Powers up for half a second and switches off with CPU installed :(. Tried both psus same thing. Well I guess I'm on the lookout for a new opteron board :(

Hi, any chance of rma?
Never had a duallie so imagine how I'd like to have a quad .
The best I had were 7x dual 1366 servers ( in a datacenter ) with 1 gbit shared connection ( each) for a project :p

@Grandpa_01 your signature is awesome.
 
Back
Top