Hung work units - 60XX

may i be worthy

[H]ard|DCer of the Month - December 2010
Joined
Aug 17, 2010
Messages
1,186
Project: 6068 (Run 0, Clone 182, Gen 117) - Hung for 6 hours overnight - I restarted client and on it went, but it hung & I restarted twice more until it finished.

Next unit: Project: 6053 (Run 0, Clone 85, Gen 88) - Hung twice by 35% .. I have shut down to run some Linpack and some memtest - I have never any EUEs.

Am I looking at a machine problem or bad units? The fact that it has happened so much in a 7 hour period across 2 units make me look first at my rig. We had some warmer weather, so 5 days ago I downclocked to 4.15Ghz on SR2#1 and dropped vcore - to a overclock that I have tested to 2 hours Linpack stable, but not my normal 8-12 hours. (hey, don't give me that look :rolleyes: we were in a race to the death - I did what it took :p )

This OC has been folding fulltime fine for 5 days until this morning. Your advice? :confused:
 
Same machine, I assume? Try turning NUMA back on and see if it does the same thing. If that fixes it, you don't want to know what your problem is... :/
 
My home machine just hung on a 6701 and a GPU WU... wtf? It's been stuck at 98% on both for about 3 hours now. I restarted and they both finished.
 
Shoot, I didn't think of that.

Possibly because I would rather gouge out both my eyes with a rusty spoon before enabling NUMA again. :eek:
 
Same thing happened to me last night,woke up this morning and had to restart both the CPU and GPU clients,happened 2 nights ago as well.

What is NUMA?
 
The NUMA recommendation is just for MIBW. It would only be available for dual processor boards.

One thing I have noticed is a couple of regular smp units recently that have crashed previously stable machines. The last two 6701s my 970 got crashed it until I cranked the Vcore to the roof. Tobit also had a couple of his stock dually that caused issues. I do not know if this is related to the hanging problem. The only time I had one "hang" was a bigadv unit with NUMA disabled on an SR-2 and a bad memory module.
 
Followup - passed memtest, passed an hour of Linpack, kept folding end of unit - I got a STOP error #124!

My fault - I was running a barely tested overclock at 4.15Ghz, and I interpolated the results needed between my well tested settings of 4.0 and 4.2, and used that voltage. But I left the vtt stupidly on the lower setting. But interesting that it is the hotter running SMPs that trap you compared to the cooler bigadvs.

So it looks like the hanging/infinite frame time issue was accidental vcore/vtt too low.

Since today was cooler I simply went back to my tried and true 4.2Ghz and finished the unit fine. Now rendering 3D frames, will chase this up later.

@musky - they crashed your clients, or gave you STOP errors?

@tobit - I suggest a VIP lounge for bigadv - I will make a donation to Stanford in return for getting just sweet bigadv. ;)
 
They completely crashed my system - reboots, OS hangs, and BSODs. My 970 is on the ragged edge at 4.389 GHz. I'll be tweaking it a little more once I get a break in bigadv on it.
 
You know, for a moment the suggestion "drop your overclock" sprung to mind, but you will be proud to know that I killed that thought the moment it was formed.

(Learning the ways to be [H]ard)
 
Funny enough I just experienced this earlier today. Haven't had a chance to reboot because I'm at work, but the SR-2 hung on a 60xx (6071) unit at 3.96Ghz at around 9:15 AM EST.

Back to 3.9Ghz for me. It's been stable and fast, and the .5% difference in folding performance isn't worth it to lose half a days work.

3.9Ghz without using the Back2Back CAS setting was rock-solid stable for over a week.

And no, I'm not turning on NUMA! :)

Good thing the Opteron 6168 duallie box is still crunching away. And no it does not have NUMA support enabled in BIOS either ;)

BTW APOLLO, you were right before. NUMA support did first appear on AMD dual and higher CPU boxes. Intel didn't fully implement it until Nehalem based servers came around.
 
Back
Top