Damaged VRM Capacitor on H8QGL?

plext0r

[H]ard DCOTM x3
Joined
Dec 1, 2009
Messages
780
I was removing one of the stock aluminum VRM heatsinks and I think I damaged/destroyed a capacitor on the bottom of the motherboard. Can someone confirm?

All four CPUs are still folding at 3.4GHz with 1.1875V with no problems (about 590K PPD). Is this single VRM out of commission and putting a load on the others or is something else bound to happen over time? Thanks for any help.

 
Last edited:
A pic of the cap would help.

EDIT: your embed looks broken.
 
Last edited:
A pic of the cap would help.

EDIT: your embed looks broken.

I don't have the cap. :) I just heard something hit the table where I was working, but I never found anything. Looks like my pliers slipped, damaged the corner of that component and popped the cap off, but I'm not sure. Maybe all I heard was the corner piece of plastic falling down and hitting the static bag. :confused:
 
Be happy its posts, its gonna be hard to tell you what effect it will have. not much you can do except run it as normal and see if it fails.

What I do to remove the stock heat sinks if cut them off at the top.
 
Just zooming in and looking at the traces, that appears to be off pin 16 which is VCC (12V input) of the phase IC. Looking at the example schematics on the IR datasheet it is difficult to say for sure but I believe this is "just" a decoupling capacitor for that device.
If so, it should function ok. Could be a little noisey but there are a lot of other decoupling caps on that same 12V rail to help out.
Looks like you were lucky since the corner of the chip is also missing.
I would definitely monitor vcore of that CPU for a few days to make sure there is no drift or anomalies.
 
Which CPU is Node 6/7 on the GL board? It was hotter than the rest before and after the VRM modification but it looks like I need to get some extra air in that vicinity.

Code:
Processor temperature slew rate:9.0°C

Temperature table:
Node 0	C0:37	C1:37	C2:37	C3:37	C4:37	C5:37	C6:37	C7:37
Node 1	C0:37	C1:37	C2:37	C3:37	C4:37	C5:37	C6:37	C7:37
Node 2	C0:36	C1:36	C2:36	C3:36	C4:36	C5:36	C6:36	C7:36
Node 3	C0:36	C1:36	C2:36	C3:36	C4:36	C5:36	C6:36	C7:36
Node 4	C0:48	C1:48	C2:48	C3:48	C4:48	C5:48	C6:48	C7:48
Node 5	C0:48	C1:48	C2:48	C3:48	C4:48	C5:48	C6:48	C7:48
Node 6	C0:51	C1:51	C2:51	C3:51	C4:51	C5:51	C6:51	C7:51
Node 7	C0:51	C1:51	C2:51	C3:51	C4:51	C5:51	C6:51	C7:51
 
Looks like a cap. Check other phases out; they all look pretty much alike.

E.g. note C/R/C sequences: CD251/RD157/CR277, CD252/RD158/CR273 and CR253/RD159/CR270.

It seems you are missing CD267.
 
Core32 said:
I believe this is "just" a decoupling capacitor for that device.
^^ +1 (CVCCL on app diagram)

Core32 said:
I would definitely monitor vcore of that CPU for a few days to make sure there is no drift or anomalies.
And that, too.

Pull fresh voltcheck.sh; it now displays min and max values in addition to average.
http://hardforum.com/showpost.php?p=1039265047&postcount=665

EDIT: and avoid using any sort of power saving features on that board, esp. do not run w/ondemand scaling governor and disable higher-order C-states if possible.

EDIT2: Node/CPU mapping on SM boards is straightforward: CPU1 == Nodes 0/1, CPU2 == Nodes 2/3 and so on..
 
Last edited:
Could RMA the board if you did not stick on any coolers. Just a thought thinking more for the long haul.
 
Just zooming in and looking at the traces, that appears to be off pin 16 which is VCC (12V input) of the phase IC. Looking at the example schematics on the IR datasheet it is difficult to say for sure but I believe this is "just" a decoupling capacitor for that device.
If so, it should function ok. Could be a little noisey but there are a lot of other decoupling caps on that same 12V rail to help out.
Looks like you were lucky since the corner of the chip is also missing.
I would definitely monitor vcore of that CPU for a few days to make sure there is no drift or anomalies.

Where did you find that datasheet? :)
 
It's quite possible it's never been soldered (not seeing any solder traces) and is tagged as DNF (do not fit)
in the schematics. Same goes with RD155.

Not 100% sure.
 
DNF, is a no stuff component for sure.
Places I have worked use DNS (Do Not Stuff) and DNP ( Do Not Populate) for the same thing.
 
Yup.

There's no DN* printed on the board. I'm just suspecting they may be DN* as there is no trace of damage...
 
I've had such small caps ripped off CPUs before, even a whole array on an E8600. That chip worked, though vcore was a bit more unstable than it used to be.
 
Not surprising it was unstable. The closer the caps are physically to the actual chip pins the more effective they are to reducing ripple and noise.
And being on the CPU itself is pretty darn close. ;)
 
I've been unable to take downtime on this host and run voltcheck, but I looked at the IPMI readings. What's interesting to me is that I have Vcore set to 1.875 via TPC, but IPMI sensors show the following:

Code:
CPU1 Vcore	Normal	1.2 Volts
CPU2 Vcore	Normal	1.184 Volts
CPU3 Vcore	Normal	1.152 Volts
CPU4 Vcore	Normal	1.176 Volts

I believe CPU4 is the CPU which is running the hottest (node 6/7) and it's the farthest from the 8-pin power connectors. It's also the one where I dinged the motherboard from underneath. :( I was surprised to see CPU1 is running 1.2 Volts.

I also noticed the CPU temps as follows which makes me think the nodes and CPU numbers don't match up on the GL board when running 6200 CPUs (as mentioned in another thread):

Code:
CPU1 Temp	Normal	Low
CPU2 Temp	Normal	Medium
CPU3 Temp	Normal	Low
CPU4 Temp	Normal	Medium
 
Not all sockets are created equal. Sockets 1 and 4 have "best" regulation on the GL, followed
by sockets 2 and 3.

At load, GL, MC, 3000 MHz:
Code:
CPU1 (min/avg/max): 1240/1242.6/1246 mV (nominal: 1.2375, delta: -5.1 mV)
CPU2 (min/avg/max): 1204/1208.0/1214 mV (nominal: 1.2125, delta: 4.5 mV)
CPU3 (min/avg/max): 1204/1205.8/1210 mV (nominal: 1.2250, delta: 19.2 mV)
CPU4 (min/avg/max): 1200/1201.8/1204 mV (nominal: 1.2000, delta: -1.8 mV)

So, as you can see, sockets 1 and 4 are actually overshot a bit and socket 3
significantly underregulated.

This pattern (1-4-2-3) is common to all GL boards I have worked with.

That said, given that your nominal voltage is the same across all sockets, your order
is 1-2-4-3 -- something happened there, methinks. Try running voltcheck whenever you
can to get more, er, wholistic view :)
 
Hmm, just read your comment about 6200 + GL. I'd find it weird but...
you could run an experiment to identify the mapping.

1. Start tpc -mtemp
2. Impair cooling of CPU1 (disconnect the fan)
3. Identify node pair that warms up..
4. Restore cooling of CPU1
5. Impair cooling of CPU2
6. Identify node pair that warms up
7. Restore cooling of CPU2
8. and so on...

(make sure that FAH doesn't crash in the process, too)


GL socket layout:
Code:
GL:
---I/O---
[3]   [4]
  [1|2]
---------
 
Hmm, just read your comment about 6200 + GL. I'd find it weird but...
{snip}
GL socket layout:
Code:
GL:
---I/O---
[3]   [4]
  [1|2]
---------

I have confirmed the other post. My IL ES in the GL board are as follows:

Code:
CPU1: node0/1
CPU2: node4/5
CPU3: node2/3
CPU4: node6/7

I've also touched the surface of the copper heatsinks with a thermocouple attached to a Fluke meter and it says all of the heatsinks are between 75 and 85C except one hit 103C! It was one of the ones near CPU3.

I disabled IPMI via jumper, but I'm running CentOS 6 (kernel 2.6.32) so voltcheck cannot find the right path under /sys/bus/i2c. I see the following modules loaded for lm_sensors:

Code:
ipmi_si                41659  0
ipmi_msghandler        34994  1 ipmi_si
w83627ehf              24196  0
hwmon_vid               3132  1 w83627ehf

Can voltcheck be tweaked to work with this older kernel or should I try something more modern?

Also, I tweaked the BIOS to enabled Fan Performance Mode and my temps have dropped significantly. Here's a snapshot:

Code:
Node 0  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33
Node 1  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33   C6:33   C7:33
Node 2  C0:30   C1:30   C2:30   C3:30   C4:30   C5:30   C6:30   C7:30
Node 3  C0:29   C1:29   C2:29   C3:29   C4:29   C5:29   C6:29   C7:29
Node 4  C0:39   C1:39   C2:39   C3:39   C4:39   C5:39   C6:39   C7:39
Node 5  C0:39   C1:39   C2:39   C3:39   C4:39   C5:39   C6:39   C7:39
Node 6  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37   C6:37   C7:37
Node 7  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37   C6:37   C7:37

Now it makes sense why Node 4/5 is hottest since it HSF is getting warm air from CPU1.
 
Yo bro, do you have any fans blowing over those copper heatsinks? I would highly suggest and advise that you do have some fans setup to move air.

I am not getting any over 90F per IR thermometer.

Here:
2013-01-26%2011.09.33.jpg


Note the fans. Note fan placement on the GL board between 212's. You NEED air moving across the board... Disregard the dirty basement.

Temps from GL, under load.
Code:
Temperature table:
Node 0  C0:34   C1:34   C2:34   C3:34   C4:34   C5:34
Node 1  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33
Node 2  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37
Node 3  C0:37   C1:37   C2:37   C3:37   C4:37   C5:37
Node 4  C0:36   C1:36   C2:36   C3:36   C4:36   C5:36
Node 5  C0:35   C1:35   C2:35   C3:35   C4:35   C5:35
Node 6  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33
Node 7  C0:33   C1:33   C2:33   C3:33   C4:33   C5:33
 
Would be usfull to know if he has it in a case or not, if he is in a case them something is wrong as there should be air going though the case, if he is bare then he needs more fans.

My mosfets heatsink's are cool to touch on my Gi.

Temps from 627x OC'd to 3GHz @ 1.175v on my Gi under load.
Code:
Node 0	C0:30	C1:30	C2:30	C3:30	C4:30	C5:30	C6:30	C7:30	
Node 1	C0:30	C1:30	C2:30	C3:30	C4:30	C5:30	C6:30	C7:30	
Node 2	C0:30	C1:30	C2:30	C3:30	C4:30	C5:30	C6:30	C7:30	
Node 3	C0:29	C1:29	C2:29	C3:29	C4:29	C5:29	C6:29	C7:29

Also nice set up scotty
 
Thanks for the confirmation, brilong. Now we need to have firedfly make some updates...

brilong said:
I disabled IPMI via jumper, but I'm running CentOS 6 (kernel 2.6.32) so voltcheck cannot find the right path under /sys/bus/i2c. I see the following modules loaded for lm_sensors:

Code:
ipmi_si                41659  0
ipmi_msghandler        34994  1 ipmi_si
w83627ehf              24196  0
hwmon_vid               3132  1 w83627ehf
That (ipmi_si loaded) actually tells us that BMC has _not_ been disabled. Did you
completely unplug AC before switching the jumper?

brilong said:
Can voltcheck be tweaked to work with this older kernel or should I try something more modern?
You can do this: pull http://darkswarm.org/w83795.tar.gz, traverse directory structure to find
the source, then -- make ; sudo make install, then, load the module -- sudo modprobe w83795
and run sudo ./voltcheck.sh ...
 
When I took the copper heatsink temps, I touched the thermocouple on the surface of the heatsink and the two 230mm case fans were not spinning at the time. I had the Blackhawk Ultra case cover off and I was systematically disconnecting each CPU fan while tpc -mtemp was running.

I have one 230mm fan which I cut down and fit behind the 5.25" bays of the case so it's blowing straight at CPU1/2 (and hitting the copper heatsinks on the front).

I moved one 230mm fan into the top left of the case cover so it's blowing down on top of CPU3/4 and hitting the copper heatsinks from above. It's only running when the case cover is in place.

Moving this second 230mm fan on top of CPU3/4 helped their temperature dramatically. I cannot test the VRM heatsink temp with all fans installed and running.

I downloaded the w83795 source, compiled and installed it. I then ran voltcheck a few times while folding (note my run-tpc script set voltage to 1.175; I had been previously running for a week at 1.1875).

Code:
CPU1 (min/avg/max): 1188/1191.2/1194 mV
CPU2 (min/avg/max): 1172/1175.6/1184 mV
CPU3 (min/avg/max): 1150/1151.6/1154 mV
CPU4 (min/avg/max): 1176/1177.6/1178 mV

I've noticed that running at 1.175 is not folding as quickly with the same freq. Running tear's special "ps" command, I've noticed certain cores are only 90% utilized instead of 99% (total is 63xx% in top). As soon as I alter voltage to 1.1875, total goes to 6397% in top and my PPD is back where it was before I added the copper heatsinks and possibly damaged the motherboard.

Now that I have the copper heatsinks in place, I'd like to try ramping things up a little more, but I'm worried about the one heatsink (covering 3 or 4 VRMs) which is reaching 103C (without air).

EDIT: When I put the voltage back to 1.1875, here's voltcheck.sh output:
Code:
CPU1 (min/avg/max): 1196/1198.0/1200 mV
CPU2 (min/avg/max): 1178/1180.0/1184 mV
CPU3 (min/avg/max): 1154/1159.6/1170 mV
CPU4 (min/avg/max): 1186/1192.8/1196 mV
 
Back
Top