Graphics Card Necromancy, Continued: Radeon R9 290X

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,646
Folks! Following up on my previous thread here, we've got a new pile of dead graphics cards on the bench, and we're going to attempt to get at least one of them working.

For those who prefer video, I made a video version.


The cards in this case are Radeon R9 290X's - once again, we'll call them Card A, B and C.

IMG_4067.jpg

Card C has belonged to me for several years. I actually traded for it with a fellow [H]er, back when it was relatively current. It works perfectly fine, although as you can probably guess from the photo, fans are somehow a consumable for these things.

Cards A and B I bought on ebay for about $20 apiece. They're both "dead," although Card B did actually produce a picture when I tested it a couple of weeks ago. Subsequent tests result in no picture, though. All three cards are identical physically, but Card A has the reference BIOS, as opposed to the Sapphire overclocked version that B and C have.

Given the apparently intermittent nature of the problem with Card B, we're going to concentrate on Card A as our candidate for repair first. Card C will serve as a reference, since it's undamaged, and works properly.

I've already removed Card A's heatsink. Unsurprisingly, there's no sign of physical damage to the front of the PCB.

IMG_4070.jpg

But if we look at the back...


IMG_4760.jpg
What's this? A missing cap? Hmm...

I'd be sort of surprised if the missing cap is the reason our card doesn't work. Those tiny caps are generally there to help filter out noise in the power plane, but as someone commented in the GTX 690 thread, his card didn't work with one of the smaller ceramic SMD caps broken off, so some designs may be sensitive enough where most of the caps are actually critical.

Before we just replace that capacitor, though, we should do some additional testing, to make sure it's related to our apparent problem. First, we'll check resistance on each of our voltage rails to see if any are shorted or open. Remember, we're looking for between 1 and 1000. Anything in that range is probably OK.

Here's VCore. Looks OK.
resistance_vcore.jpg

VDDCI (AKA memory power). Also looks OK.
resistance_vddci.jpg

The Aux rail. Also looks sane.
resistance_aux.jpg

The 1.8 rail. I think this is related to the display ports... looks sane.
resistance_1_8.jpg

The .95 rail - I don't actually know what this does, but it's required for function. Also looks sane...
resistance_95.jpg

Unknown SOP-8 chip on the back of the card, pin 8, which is usually the phase pin on this type of regulator. This looks sane too, although I don't know for sure what this IC does.

resistance_unknown.jpg


Ok, that's all of our resistances. We didn't find any shorts, so that's good, and there's nothing with a huge resistance, which might indicate a totally open circuit. Now we need to power the card up and see which rails actually run.

VCore - this should be 1.0 - 1.2 volts. So we know this rail isn't working.
voltage_vcore.jpg

VDDCI - this should be about 1.5 volts, so we also know that this rail isn't working either. Notably, this and VCore share a pretty complex controller.
voltage_vddci.jpg

.95 - Ok, this one is working.

voltage_95.jpg

1.8v Rail - This one is working too.
voltage_1_8.jpg

Aux - Not working. I have a feeling that this may be waiting for an enable signal from something else, maybe the memory rail. I think I mentioned in the GTX 690 thread that the output of one VRM is frequently wired up to the enable input on another, so that rails start up in a specific order.
voltage_aux.jpg

5V Rail - Also working. This is what powers the VRMs themselves. The controllers need power of their own, and in some cases it's also used for the gate drive of the MOSFETs.

voltage_5.jpg

Ok, so we've learned that something major isn't working at all. These symptoms lead me to suspect the problem lies in or around the control IC for our VCore VRM, which is shared with the VDDCI VRM. That's a pretty elaborate chip with 56 (!!!!) tiny pins. I think the next step is to look for anything simple, like power to it, or an enable signal that's missing, and for that, I need to consult the data sheet.
 

Attachments

  • resistance_vcore.jpg
    resistance_vcore.jpg
    1.1 MB · Views: 0
Subscribed (again). I hope to learn more from your endeavor.
 
Was following your 690 thread and by coincidence, I just had a G92 card go dead short and blow up the mosfet in a PSU in a family computer.
 
You're the text version of Louis Rossmann

And here I was thinking he was the video version of me... :D

Also, I made a video of this one.

Was following your 690 thread and by coincidence, I just had a G92 card go dead short and blow up the mosfet in a PSU in a family computer.
G92 = 8800GT generation? You probably got your money's worth out of that.

A quick update:

Three dead rails, two controllers. One (ONSemi NCP5230) has a reasonably complete data sheet. The other (Infineon IR3567B) just has a pinout, with no explanation of what the pins do, I suspect because it's super complex and they actually program it at the factory for their customer's specific application, and it's designed to control two rails independently.

I did a lot of probing, and as best I can tell, I'm missing an enable signal for the memory rail. The trace has a test pad I can probe, but I think the other end of it is on the opposite side of the board, so I'm going to have to rig up a way to find it. I wanted to cover this in a video, so I didn't take any pictures. Without finding the other end of the trace, I can't be sure whether it's the memory waiting on the core/aux to power up or the other way around, but I don't think both of them are actually damaged. Another possibility is that the memory rail is starting up, but has a short I haven't found, and then aborts.

Lastly, with great respect, I think Mr. Buildzoid may be incorrect in his 290X breakdown video. He claims that the memory and core VRMs are controlled by the IR3567B, and just kind of glosses over the Aux rail. This is untrue. In reality, it's the core and aux that are controlled by the IR3567B, and the memory is controlled by the NCP5230.

Oh yeah, also Card B just started working again, at least well enough to load windows. Not sure what's up with that, but it's convenient for testing purposes.
 
Semi-bored at work, and browsing ebay, I found this:
https://www.ebay.com/itm/MSI-Lightn...739910?hash=item2acd954b86:g:rbQAAOSwdtBc-KJT

And this:
https://www.ebay.com/itm/Titan-Z-12...587991?hash=item287ed51e57:g:eCIAAOSwTPdc-vrH

I'm kind of tempted to buy that Titan Z...

Anyway, progress!


I tracked down the data sheets for our control and phase drive ICs, and tested the phase drive pins for each of the large power MOSFETs on the board. We need the data sheets because two of each mosfet's three terminals are hidden under that huge drain terminal on the top, so we need to know which pins on the controller they connect to, and check there.

The VCore ones seem sane...
High side:
vcore_high.jpg

Low Side:
vcore_low.jpg

The memory rail has a different controller with integrated phase drivers. Its pins are SUPER tiny, so I'm not even sure I'm probing the right pins, but this all looks at least kind of sane. I don't see anything that's obviously a problem here, so I'm moving on. I'll come back to it if I can't find anything else wrong...
mem_high.jpg
mem_low.jpg


Finally, I looked at the Aux rail.

High side looks alright.
aux_hi.jpg

But the low side...
aux_low.jpg

That's, uh, not so good.


So, at this point, we know that some part of the low side gate drive on the Aux rail is basically shorted to ground. The problem could lie in either the phase drive IC or the MOSFETs themselves, but we can't tell which it is with both of them still on the board. So, let's remove the drive IC, since it's the easier of the two.

Flux on...
flux.jpg
Heat it up...
heat.jpg

And off it comes.
yank.jpg

Now, we test the resistance on the low side gate again.
better.jpg

That looks much better. We'll confirm our issue by testing the IC we removed.
yep.jpg

Same resistance value as when it was on the board. So, while the MOSFETs may also be hosed, we know this IC definitely is.

While I could cannibalize one of the working cards, I think I'll just source a new IC. Ill probably also get a couple of fresh MOSFETs just in case, too. So, we'll reconvene once we have our spares on hand and can solder them back on the board.
 
As an eBay Associate, HardForum may earn from qualifying purchases.
Aaaaand we're back. First, I'd like to apologize to those of you who were following along. I didn't mean to abandon this thread, but real life happened, and the card sat on my healing bench for two months, and I'd look mournfully at it every once in a while. Today, though, it's hot as balls outside, meaning it's a perfect day to get back after this.

I eventually got my hands on some replacement MOSFETs and a replacement phase drive IC, and swapped the old ones out with the new ones. Here it is with the flux reside still on it.
IMG_5126.jpg
IMG_5125.jpg

Unfortunately, the Aux rail, and thus the card, is still dead. At this point, I was pretty stumped. Clearly, there is something wrong with the circuit that creates this rail, but I've now replaced the three most complex and delicate components, and it's not fixed. Today, though, I took a closer look at one of the other 290Xs that I have, and I noticed that on my dead card, this little resistor here has no markings at all, which may not be indicative of damage, but struck me as odd, since they usually have at least some kind of marking. On my other cards, it's marked with a zero, which usually indicates that it's a zero ohm resistor.
suspect_resistor.jpg

A resistance check across it reveals an open circuit, though. I sanity checked this against one of my working cards, and sure enough, I get zero ohms on that card.

A little more probing reveals that it's connected to pins 2 and 4 on our dead phase driver IC. These pins accept the voltage that will be supplied to the high side and low side mosfets' gate drives, respectively. If no voltage is supplied here, the phase driver won't be able to turn the mosfets on at all, because the signal it sends to them is dead. Perhaps unsurprisingly, the other side of the dead resistor is connected to our 12V pin on the PCI-E edge connector. Per the data sheet for the CH8510/IR3537 phase drive IC, you're supposed to supply 12V to those pins.

So, we've found a burned out zero ohm resistor, which seems to be used as a fuse. Our next step is to replace that resistor with a good one.

Here's our patient, once again prepared for surgery.
IMG_5127.jpg

Bad resistor removed.
IMG_5129.png
IMG_5130.jpg

The pads cleaned up, and ready for the new resistor.
IMG_5131.jpg

New resistor installed. Note the "0" marking.
IMG_5132.jpg
 
Well, crap.

I changed out the failed resistor, checked resistances on the rail's various pins, and then, satisfied I hadn't obviously made it worse, plugged the card back into the test bench and switched it on. Sadly, the aux rail is STILL dead.

I triple checked that the proper voltage is being supplied at each of the phase driver's pins, and it is as far as I can tell. I've got 12V at VCC and both gate supplies, and ~11.5v at the bootstrap pin, which I think is appropriate if the phase isn't actually running. I then checked the numerous 0 ohm resistors on the board, and all of the minor rails one more time. None of the resistors are open, and all of the minor rails I tested in the original post are still working.

At this point, there are only so many components on the card that could cause this, with the number one suspect being our main voltage controller, the IR3567B.

As I mentioned before, this IC is super complex, and is also not very well documented. Furthermore, it's apparently programmed by the factory to the specs given by the customer. This is problem in that even if I could remove it from the board and replace it (not a given, since it has a HUGE ground contact on the bottom), I would have to get one that's programmed for use on a reference 290X. I have two working boards that I could cannibalize, but that doesn't seem right, cannibalizing a working board to fix a dead one.

Anyway, to rule out something simple, I did at least check that the obvious required voltages are present at the right pins. AMD was kind enough to provide probe pads near this thing, so I at least didn't have to probe the QFN pins directly.

VCC for this IC is 3.3V. I've got 3.3 volts at the pad for pin 39 (blue), which is its main power, and at the enable pin (red). From what I can tell, if those are present, it should be attempting to run. I also checked the CFP pin (yellow), which is an output that goes high if the output of the VRM exceeds whatever its configured maximum voltage is. Zero volts there.

IMG_5135.jpg

And that's where I'm at. I suspect what may have happened is that one of the low side MOSFETs on the Aux rail shorted the gate pin to its drain. When this happened, it sent 12V back through the phase driver, and then back through its current sense output to our controller, which then lobotomized the controller.
 
Aww. That's too bad. I don't have your acumen for this stuff, but your MOSFET hypothesis sounds well-reasoned. Fare thee well, 290x.
 
Sad we won't get to see this thing resurrected but really impressed with how far you have made it. Way over my head and very interesting. Thanks for the thread!
 
Fare thee well, 290x.
Not so fast! You may remember from the initial post that we actually started with two "broken" cards, but we only looked at one.

I made another video that those who don't like reading my technobabble may prefer:


I was bored the other day, so I plugged in Card A to take another crack at it, and as I did, I heard one of those sickening, very faint, "pop" sounds that anyone familiar with repairing circuit boards is doubtless familiar with, and now I have a near-short on the 8 pin power connector. I suspect it's a capacitor somewhere, but given the other issues with that board, I don't think it's worth the hassle of trying to hunt it down. You may remember, though, that we also had a second 290X, Card B, which I mentioned previously seemed to work. I decided to have a closer look at that one, since it was sold to me as not-working, and a cursory look at it previously showed that it actually did kind of work, but that the heatsink would get REALLY hot after just a minute or two of idling.

The fact that it works means we don't need to do the steps to test resistance or voltage on its power rails. If any one of them were in short or not working, the card wouldn't work at all. Nevertheless, there's clearly something wrong, as this card has a serious heatsink, and it's actually pretty efficient when it's not doing 3D tasks. It shouldn't run nearly as hot as it did, even with no fans on it.



One thing I noticed was that it was the core VRM area that would get hot. Now, the reason a VRM gets hot is because its switching transistors are conducting, but with resistance. The higher the resistance, the more energy they burn off as heat. If you read my thread about the GTX 690, you may recall that I had a similar problem with one of the MOSFETs on at least two different cards, and I eventually figured out that this was because my power supply was unable to provide the full 12 volts under load, and because those MOSFETS expected straight 12V on their gate pins, they wouldn't quite switch on all the way, leading to unusually high resistance, and an un-coolable amount of heat. In contrast, most VRM designs on modern graphics cards actually use a lower voltage - something like 5 or 7 volts - on the gate pins for most of the power transistors. This voltage is regulated down from the 12V supplied by the power supply, so that the main transistors aren't exposed to the power supply's drooping voltage. It's frequently also more efficient.

At first, I thought I had a similar problem, where the gate voltage for the MOSFETs was too low, causing them to fail to turn on all the way. Then, I noticed that it was only one spot on the board that was getting hot - this was a clue, because it meant that only one of the card's five Vcore phases was affected. I was puzzled by this until I turned the card over and looked more carefully at the back, and saw this:
IMG_4145.jpg

I know the image is super blurry, but what we're looking at there is a missing SMD cap right under the VRM phase that was overheating. This was a clue!

As it turns out, the three SMD components you see there are part of the IR3567B's per-phase current sensing circuit. This is used to make sure that each of the phases is providing roughly the same amount of current. With one of the caps missing, it throws off the calibration of this circuit, causing the affected phase to operate out of spec. Since we have Card A to get parts from, replacing the cap isn't a problem, but before I did, I wanted to be sure that the phase was actually operating different from the other four. So, what I did was hook the card up to the test machine, and measure the duty cycle on the gate pins to each phase's MOSFETs. This is the fraction of time, as a percentage, that the gate pin is high, meaning it should be conducting.

This is the affected phase's low side gate:
low_side_bad.jpg

And its high side gate:
high_side_bad.jpg

And one of the undamaged low-side gates:
low_side_ok.jpg

And the corresponding high side gate:
high_side_ok.jpg

A few things to note here. Obviously, our damaged phase is spending more time with its low side MOSFET on than the others are, and less with its high side. Also of note is that the affected phase has at least one of its transistors on nearly 100% of the time, which is also not good. So, we've clearly got one phase out of whack, and an obviously missing component.

So, I cannibalized the corresponding cap from Card A, since it won't be needing it anymore, and soldered it on to card B. This is a 250nF cap, for anyone needing to know.
IMG_4147.jpg

Next, I put the heatsink back on, plugged the card back in, and started the system up, and, lo and behold, we've now got even duty cycle across all five phases! Furtermore, the back of the card, which previous got very hot, is now cool enough to touch while the card is idling.

I also noticed, as I was cleaning the flux off the card, that two other caps were missing, so I cannibalized those from Card A as well. These are both related to the memory power circuit. The card seemed to work without them, but I figured I might as well. They're both on there kind of crooked because the card has so much copper in that area that it's hard to get it hot enough to get the solder to flow very well. It can be tricky to walk the line between overheating the PCB or damaging one of the big delicate ICs on the board and not getting the board hot enough to melt the solder. The area shown is behind the GPU die.
IMG_4152.jpg


IMG_4153.jpg

At this point, I was pretty sure I had fixed the problem that was crippling Card B, so I proceeded with putting it back together, using Card A's blower heatsink. Card B is actually a Sapphire Tri-X model, which is supposed to have a huge axial flow heatsink with a bunch of heat pipes, but I don't have the fan assembly for it, so I went with the blower. If anyone has a dead Tri-X card and wants to sell me the cooler, I'd be interested.

This is the crudest heatsink I've ever seen. The mating surface between it and the die is visibly wavy.
IMG_4150.jpg

I cut new thermal pads. I ended up making three sets before I finally got them thin enough to get the heatsink in contact with the die. With this Arctic brand pad material, the thickness you want is 0.5mm. 1.5mm shown here, which is WAAAAAAY too thick, although it's good for the mosfets. It squishes down onto them really well.
pads.jpg

At length, I had the card back together. I think the results speak for themselves. Previously, it would have overheated and shut down before I could even get Heaven running.
IMG_4160.jpg

It scored about 2600 in the Heaven benchmark, which I think is about right for a 290X. It may be a little low, but the CPU it's hooked up to is an FX-8350, which is slow. Now, it's on my to-do list to install the system in a case and play some actual games on it, before the card joins my repaired GTX 690 on the shelf of rest. :cool:
 

Attachments

  • upload_2020-1-17_19-6-34.png
    upload_2020-1-17_19-6-34.png
    3.8 MB · Views: 0
I refurbed about 10-15 of those back in the day when I was buying up half broken mining cards. The thermal pads you are using are way too thick and have to be putting pressure on the PCB and taking away mounting pressure on the GPU. I believe those pads are either .5mm or 1mm.. I wanna say .5


Major props on the soldering though!! That's my weakness.


EDIT: I missed one of your paragraphs. I stand corrected!
 
I refurbed about 10-15 of those back in the day when I was buying up half broken mining cards. The thermal pads you are using are way too thick and have to be putting pressure on the PCB and taking away mounting pressure on the GPU. I believe those pads are either .5mm or 1mm.. I wanna say .5


Major props on the soldering though!! That's my weakness.


EDIT: I missed one of your paragraphs. I stand corrected!
I think the original pads are supposed to be 1.0mm actually, but they're really squishy. The pad material I used is much harder, and 0.5mm seemed to be what actually fit. I considered just reusing the old ones, but they were gross when I got it, and then subsequently got dust and cat hair in them as they sat on my work bench for months. Bleh.
 
Very cool. I had a 290x that died due to a watercooling accident - a little coolant leaked onto the card while it was running and from then on out, it had a white vertical bar on the right side of the screen in text/BIOS mode, and would BSOD Windows on startup. :( Thinking it might be left over moisture, I tried baking it @ 175 for a few hours to no effect, so I sold it for parts to someone that wanted the blower cooler from it (which had only ever been used to test the card since it was under water when I actually used it).

I always suspected it might be a RAM chip that died on it, but who knows...
 
Very cool. I had a 290x that died due to a watercooling accident - a little coolant leaked onto the card while it was running and from then on out, it had a white vertical bar on the right side of the screen in text/BIOS mode, and would BSOD Windows on startup. :( Thinking it might be left over moisture, I tried baking it @ 175 for a few hours to no effect, so I sold it for parts to someone that wanted the blower cooler from it (which had only ever been used to test the card since it was under water when I actually used it).

I always suspected it might be a RAM chip that died on it, but who knows...


Baking it would only make it worse. You need to put it in an ultrasonic cleaner.
 
Cool you got the cards working again.

I was just wondering, I have my friends old HD6950 that stopped working last year. One day it was working fine, next day it stopped working, PC just shut off and now causes the PC to not even turn on.
I tried it in 3 other machines and the same thing happens, hit the power switch and nothing happens. Pull the card out and the PC's turn on.
Have any idea what to look for?
 
Cool you got the cards working again.

I was just wondering, I have my friends old HD6950 that stopped working last year. One day it was working fine, next day it stopped working, PC just shut off and now causes the PC to not even turn on.
I tried it in 3 other machines and the same thing happens, hit the power switch and nothing happens. Pull the card out and the PC's turn on.
Have any idea what to look for?
The most likely cause is a short to ground on one of the 12V power rails. The reason the system won't power up is that it's tripping the over-current protection in the power supply. If you have a multimeter, check the resistance between the +12V pins at the 6 or 8 pin connector and the ground plane. What you should have is a few thousand ohms resistance. My guess is that you'll find that you have less than 100, and probably more like zero on at least one of the power connectors.

If you can determine that this is the case, the task then becomes finding and removing the short, which is much more involved, but not by any means impossible.
 
Baking it would only make it worse. You need to put it in an ultrasonic cleaner.
Cleaning and reballing or replacing the affected memory chips is the real proper way to do it. The risk you run with an ultrasonic is that if there were any visual indication of which components were affected, you'd clean it off.

But, in the absence of a BGA machine, trying an ultrasonic, or even a dishwasher, in a pinch, is probably your best bet.
 
Cleaning and reballing or replacing the affected memory chips is the real proper way to do it. The risk you run with an ultrasonic is that if there were any visual indication of which components were affected, you'd clean it off.

But, in the absence of a BGA machine, trying an ultrasonic, or even a dishwasher, in a pinch, is probably your best bet.
I've had luck cleaning a 7970(I think) with rubbing alcohol that had a little water drip on it way back in the day. That has to be better then a dishwasher?
 
I've had luck cleaning a 7970(I think) with rubbing alcohol that had a little water drip on it way back in the day. That has to be better then a dishwasher?

Without also brushing, like I showed in the video, rubbing alcohol is pretty meh at removing corrosion, which is probably what you need if you got water under one of the memory ICs while the system was running. I'd probably do both, though. Dishwasher first, dry the card thoroughly with compressed air, and then clean again with isopropanol.

In a lot of cases, though, cleaning the board isn't going to help, because the real problem is that the solder balls have corroded enough to no longer be making contact. In those cases, replacing the IC, or at least reballing it, is your only option.
 
Back
Top