FX-8320 proc suddenly 10oc hotter, duplicate system not

BloodyIron

2[H]4U
Joined
Jul 11, 2005
Messages
3,439
Hi Folks,

I'm hoping to see a seriously [H]ARD answer here, because I'm running out of ideas.

I'm talking about two identical systems that are used for hosting VMs, which are based around the following parts:

  1. AMD's FX-8320 (stock HSF)
  2. Asus Motherboard model M5A78L-M/USB3

I'm monitoring the health of these two systems, so I can get alerted when problems arise (like this one). Suddenly I am finding that one of the nodes is running 10oc hotter all the time (and under heavy load the temps spike well up to 70oc). I have dusted it out, several times, replaced the thermal compound between the HSF and the CPU, several times, and I know for a fact the fan on the HSF is spinning because I get RPM readings from it.

The load between the two systems is effectively equal, I balance evenly (as best I can) between them. Yet after a full day of "burning in", this node is still 10oc hotter than my other one.
  • Good one "idle" temp : 49oc , max temp during backups : ~55oc
  • Sad one "idle" temp : 59oc , max temp during backups : ~68oc

I am really at a loss as to what is going on here. Can anyone advise? :/
 
Cases? Ambient temperature at each case's air inlet? What are the exhaust outlet temperatures? I'd rule out ambient temperature differences to begin with.

Next would be a detailed analysis of each system's wattage - I'd expect the +10C to show a higher wattage pull (the reason for this is to verify that the 10C increase is due to increased power consumption, and not thermal transfer, even though you've already re-done the thermal interface).

Then it's a matter of locating that increased power consumption. Is the OS Kernel loading both systems identically? Is one system only loading half the CPU, thread management, etc.

It's where I'd start, at least :)
 
what coolers are you using? the idle temps are quite high for both. 55 is fine 68 is pretty hot, they start throttling around 70. have you tried swapping HSFs between units? what TIM are you using? also, those systems are pretty old so I would recommend testing/replacing cmos batteries.
 
He said stock hsf. Are fan rpms the same as well as ambient or motherboard Temps the same?
 
oops missed that in the brackets. yeah id say check the rpms too. maybe just set the to full blast for a bit and compare speeds.
 
Gat dang you guys are awesome! See this is why I post here, because this forum is rad. Okay, so here's some more info:
  1. Both systems have identical cases and PSUs
  2. Both systems have Proxmox VE (LinuxKVM) on them, with latest updates
  3. Both of them have 32GB of RAM
  4. Both of them were having similar temperature patterns until about one week ago (checked graphed records just now), then Node 2 started going hotter for some reason
  5. Nothing hardware wise has changed in the last bunch of months within these systems
  6. There is plenty of gap between the two for air flow (this also hasn't changed in over a year or so)
  7. Both have their fan set to "standard" (NOT silent) in the BIOS so the fan should speed up if it gets hot, and slow down when it gets cool
  8. When I say "idle" I mean they have about 5-10% steady CPU usage from various VMs that are in their own right idling
  9. I have replaced the thermal compound on both systems within the last few days

Okay, so some current stats about each system (as of this writing):

Node 1 (Good guy):
  1. CPU temp : 49oc
  2. CPU Fan Speed : 3391rpm
  3. Vcore : 1.356v
  4. +3.3v : 3.284v
  5. +5v : 5.04v
  6. +12v : 11.874v
Node 2 (Sad guy):
  1. CPU temp : 58oc
  2. CPU Fan Speed : 5443rpm
  3. Vcore : 1.284v
  4. +3.3v : 3.324v
  5. +5v : 5.1v
  6. +12v : 12.028v

I'm not sure what other useful info to state just yet.
 
Gat dang you guys are awesome! See this is why I post here, because this forum is rad. Okay, so here's some more info:
  1. Both systems have identical cases and PSUs
  2. Both systems have Proxmox VE (LinuxKVM) on them, with latest updates
  3. Both of them have 32GB of RAM
  4. Both of them were having similar temperature patterns until about one week ago (checked graphed records just now), then Node 2 started going hotter for some reason
  5. Nothing hardware wise has changed in the last bunch of months within these systems
  6. There is plenty of gap between the two for air flow (this also hasn't changed in over a year or so)
  7. Both have their fan set to "standard" (NOT silent) in the BIOS so the fan should speed up if it gets hot, and slow down when it gets cool
  8. When I say "idle" I mean they have about 5-10% steady CPU usage from various VMs that are in their own right idling
  9. I have replaced the thermal compound on both systems within the last few days

Okay, so some current stats about each system (as of this writing):

Node 1 (Good guy):
  1. CPU temp : 49oc
  2. CPU Fan Speed : 3391rpm
  3. Vcore : 1.356v
  4. +3.3v : 3.284v
  5. +5v : 5.04v
  6. +12v : 11.874v
Node 2 (Sad guy):
  1. CPU temp : 58oc
  2. CPU Fan Speed : 5443rpm
  3. Vcore : 1.284v
  4. +3.3v : 3.324v
  5. +5v : 5.1v
  6. +12v : 12.028v

I'm not sure what other useful info to state just yet.

5400rpm? Does it look and sound like it's actually turning that fast? Are these the coolers with heat pipes?
 
yup I would suspect a failed heat pipe too. swap the HSFs and see if you get the same issues on the opposite systems.
 
It's possible the HSF might have failed. The challenge is that I have to turn all the VMs off and that's... less than ideal right now since there is an impact from doing that. Clearly I need to do it anyways, but I'm going to have to ponder how I'm going to do that while mitigating outages. Obviously both systems have to be _OFF_ for that to happen (swapping HSFs). Not a scenario I had planned for.

Any other thoughts for me to explore while I try to plan that out?
 
if you can, hook up another fan to blow at the side of the hsf to help cool it until you can shut down. id' pick up another hsf just in case that's the issue. that way you can install it asap and be back up and running.
 
Lots of things you can do if you need both up. You can downclock the bad system and wait for low load times to take the one down.
 
Node 2 is exhibiting higher temperatures with a lower VCore. Sure sign of less than ideal thermal transfer. Concur with possible heat pipe failure :)
 
I have one in a large case running all the firewalls and intrusion stuff on Hyper V VM's. I gave up on the stock HSF in exactly 3 days and stuck an AIO on it I picked up for 60 bucks. It's in it's fourth year now. I keep meaning to stick this service on of the Dell 620's I'm taking out of service but it just works and replicates so well I keep pushing it to the back burner. It was only supposed to be a proof of concept project and here it is four years later....

It's not unusual to have a damaged cooler or have a leak in a heat pipe render it ineffective. AMD will replace it for you under warranty if you want to stick with stock. Since you have two, swapping them between systems and testing is a reasonable way to validate this.
 
Have you verified temps with an external probe, maybe failing thermal sensor?
 
I'm in the process of acquiring a replacement stock HSF. While it seems very evident a heat pipe has failed here, I've found the stock HSFs to be very good otherwise. They have been in 24/7 operations now for 3+ years. It's also worth noting the case they are in does not generally allow for a larger HSF. But rest assured, the case has plenty of air flow despite that, as the CPU HSF has intake right above it.

Thanks for your input folks, I think I would have been stuck without your thoughts here. I haven't really seen a heatpipe fail before!
 
I don't believe that is the case. The case itself is very warm to the touch, so I am more inclined to believe inefficient heat transfer is happening, hence the speculation of a failing heatpipe.


Have you verified temps with an external probe, maybe failing thermal sensor?
 
I didn't read all the replies, but what I would do is swap the processors and see if the higher temp goes with the CPU.
 
Replaced the HSF with another stock one. The stock one is brand new, bought it from a guy who just got it and is using an after-market instead.

Except the temps are still high :(

But it might just be the thermal compound settling too, so I'll let it go for a few more days.

I'm worried this might be a silicon failure in the CPU or mobo themselves, I hope this improves after the thermal compound settles :(

Any other ideas peeps?
 
I would swap the cpus and see if the temp stays the same, see if theres a temp sensor issue.
 
The case feels really hot though. It might be the CPU itself though.

I'm gonna give it a few days for the thermal to settle, if it keeps being too hot I'll do the CPU swap and see the results.

I would swap the cpus and see if the temp stays the same, see if theres a temp sensor issue.
 
I would swap PSU's. Those voltages are different across the board. Maybe something is rotten in denmark.
 
That certainly is something I can try! If that were the case that wouldn't be too bad of a fix IMO. It also seems kinda likely too. I'll try that after a few days of "burn-in", but maybe you're onto something here ;)

Also, I have a PSU tester that shows voltage levels, so at some point I should try that too after the burn-in, see what it spits out.

I would swap PSU's. Those voltages are different across the board. Maybe something is rotten in denmark.
 
If it isn't the PSU then probably the motherboard is beginning to flake out. CPU's don't usually go bad unless you have them overclocked to the wall.
 
Ya man, this is so weird. No OCing on these CPUs ever! This shouldn't be happening :S

If it isn't the PSU then probably the motherboard is beginning to flake out. CPU's don't usually go bad unless you have them overclocked to the wall.
 
Well, looks like the fan in the PSU seized. Sure would have been nice if I noticed that earlier, lol! Go me.

Looks like the HSF was working just fine, same with thermal, just needed to replace the PSU.

So far so good, if this doesn't solve it I'll probably pop back in and say something.
 
I didn't see bad voltages, not sure what you're talking about :rolleyes:

lol either way it seems problem solved! No more email notification spam of overheating during backups, lol!

Overheating power supply outputting bad voltages is all about a 'voltage issue'. :p:p
 
Back
Top