Graphics Card Autopsy - MSI 980 Ti "Golden Edition"

Hi RazorWind. I've got an EVGA GTX 980Ti with the same issues reported here, as soon as I apply plug in the 8pin power connector it keeps my system from powering on. I've troubleshot it to the 12v pins being shorted...likely a mosfet issue. I've taken the card apart and there is no visible damage anywhere. If there was, I'd try to replace that component. However, with there being no visible damage anywhere, my skills are not good enough to troubleshoot further. Would you want me to send this one to you to take a look at as well? Next step is trash. It's been sitting here for a few months now and I'd like to fix it to put in my son's computer. If you can't fix it, oh well. I've read your terms and am fine with them. Given this one has no visible indicators of damage, I figure it may be a good one to show us how you troubleshoot it to figure out where the actual problem is.
Sounds good. No rush on it at all. Like I said, it’s been sitting here for months. I keep having this nagging feeling that I shouldn’t throw it away because it’s something minor, but I just can’t figure out what it is.

I’ll dig up your email from this thread and send it to you. Thanks for all you’ve posted here. It’s very informative!
 
Ok, an update.

I got Solan's card "working" again by removing the "dead" FET package. While I had the card running, I checked the duty cycle on each of the coils, and discovered that we actually have two dead phases. One is obviously the phase that's missing its FETs, and the other is the one it shares control ICs (the actual phase drivers, and the doubler) with. At this point, I'm not sure if the other phase is dead thanks to current balancing on the phase doubler or due to an actual failure.

So, at this point, I think we should ask Solan what he wants to do. I can put the card back together and return it to him with the remaining six phases working, or I can attempt to replace the dead DrMOS one more time. Every time I do this, I risk further damage to the card, but I'd be lying if I claimed I actually fixed it just by clearing the short. It'd probably work OK for a while longer, though.

So, Solan, if you'd prefer to discuss this privately, feel free to reach out, or just respond here.
 
Ok, an update.

I got Solan's card "working" again by removing the "dead" FET package. While I had the card running, I checked the duty cycle on each of the coils, and discovered that we actually have two dead phases. One is obviously the phase that's missing its FETs, and the other is the one it shares control ICs (the actual phase drivers, and the doubler) with. At this point, I'm not sure if the other phase is dead thanks to current balancing on the phase doubler or due to an actual failure.

So, at this point, I think we should ask Solan what he wants to do. I can put the card back together and return it to him with the remaining six phases working, or I can attempt to replace the dead DrMOS one more time. Every time I do this, I risk further damage to the card, but I'd be lying if I claimed I actually fixed it just by clearing the short. It'd probably work OK for a while longer, though.

So, Solan, if you'd prefer to discuss this privately, feel free to reach out, or just respond here.

Razorwind, do you have a recommendation or intuitive guess here? Would replacing the known bad FET along with the next FET and the phase doubler IC all in one move be a kind of gambit fix?

Otherwise, my gut inclination would be if the reference design 980 ti is 6+2 phases then I could probably get away with losing 2 phases from this 8+2 phase card since I'd still keep it at stock clocks/power/temp. I could even clock it down closer to 1000mhz reference base clock if it came to it.

Thanks again for all your time checking and working on the patient card!
 
Razorwind, do you have a recommendation or intuitive guess here? Would replacing the known bad FET along with the next FET and the phase doubler IC all in one move be a kind of gambit fix?

Otherwise, my gut inclination would be if the reference design 980 ti is 6+2 phases then I could probably get away with losing 2 phases from this 8+2 phase card since I'd still keep it at stock clocks/power/temp. I could even clock it down closer to 1000mhz reference base clock if it came to it.

Thanks again for all your time checking and working on the patient card!

I'd probably go for it. I suspect that, if it fails again, it will fail the same way, and I can just remove the dead components, but that's not guaranteed, so I wanted to at least offer you the opportunity to get the card back "working."

I haven't tried intense gaming, but you could totally use this to surf the web and do typical work stuff, if that's all you need it for. I can test it as-is, if you want.

One thing I noticed is that the BIOS on this card doesn't spin the fans up at all until the GPU hits 60 degrees celsius. I suspect this has something to do with why this particular design seems to fail so often. The VRM seems to be somewhat fragile, and it's obviously not being cooled very well. I think I'd probably recommend a custom fan profile to owners of these cards that keeps the fans spinning even at idle.
 
I'd probably go for it. I suspect that, if it fails again, it will fail the same way, and I can just remove the dead components, but that's not guaranteed, so I wanted to at least offer you the opportunity to get the card back "working."

Fair enough, let's go "big" then!

I haven't tried intense gaming, but you could totally use this to surf the web and do typical work stuff, if that's all you need it for. I can test it as-is, if you want.
I was curious about whether intense testing would just burn the next components down the line but it seems more productive to try replacing any suspect parts first. I'm driving displays with an old GTX 660 I dug out (though the GTX 560 is still kicking too). I'll ride it out for another generation or two if this 980 ti doesn't make it. I'd say they don't make them like they used but I think I've read somewhere that manufacturers are either underspec'ing or pushing components to the limit too much on the highest end cards each generation leading to the failures that seem to be increasing in trend (at least anecdotally on enthusiast/owner forums) whereas lots of lower to mid range cards seem to keep chugging along.

One thing I noticed is that the BIOS on this card doesn't spin the fans up at all until the GPU hits 60 degrees celsius. I suspect this has something to do with why this particular design seems to fail so often. The VRM seems to be somewhat fragile, and it's obviously not being cooled very well. I think I'd probably recommend a custom fan profile to owners of these cards that keeps the fans spinning even at idle.

I think that was supposed to be a selling point on at least a few Maxwell cards. I did run it with 30% instead of 0% minimum in a custom fan curve in the past 2 years but perhaps it was left in default profile the first couple of years when my sister originally purchased it (I bought it from her in 2018 when she upgraded to a Pascal card). My wife has a EVGA SSC ACX 970 in her system that also defaults to "fanless" operation so we also overrode it with a profile. They're so quiet at low speeds I'm surprised any companies wanted to pitch "fanless".
 
  • Like
Reactions: Azrak
like this
Oh, how did I miss this thread until now!

I've had a 980 ti burn in my hands before (looks exactly like your first post and also prevented the PC from booting) but it was successfully RMA'd*. The replacement failed too (but without the smoke and smell) but it was in someone else's hands by then and I have no idea what he's done with it (warranty expired). I actually know of many failed 980 tis over the last few years (and not talking of heavily overclocked cards), maybe nvidia did not engineer that board very well? But I don't know the actual failure rate mind you, just a gut feeling.

Still, the same does not appear to be happening with Pascal generation. Even just the reference 1080 ti seems to be built like a rock (buildzoid said so after all and he knows his stuff).

*the one very strange thing about that failure though is that it also killed the (disabled) onboard audio of my motherboard AND killed my USB powered soundcard (both under warranty thankfully). Nothing else broke and all of the other parts from that build are still in use to this day (although in different machines).
 
Last edited:
Oh, how did I miss this thread until now!

I've had a 980 ti burn in my hands before (looks exactly like your first post and also prevented the PC from booting) but it was successfully RMA'd*. The replacement failed too (but without the smoke and smell) but it was in someone else's hands by then and I have no idea what he's done with it (warranty expired). I actually know of many failed 980 tis over the last few years (and not talking of heavily overclocked cards), maybe nvidia did not engineer that board very well? But I don't know the actual failure rate mind you, just a gut feeling.

Still, the same does not appear to be happening with Pascal generation. Even just the reference 1080 ti seems to be built like a rock (buildzoid said so after all and he knows his stuff).

*the one very strange thing about that failure though is that it also killed the (disabled) onboard audio of my motherboard AND killed my USB powered soundcard (both under warranty thankfully). Nothing else broke and all of the other parts from that build are still in use to this day (although in different machines).
With great respect for Buildzoid, he's actually wrong fairly often. He seems really credible, but a lot of what he says is just him speculating, and I suspect if you called him out on this, he'd freely admit that. I've definitely seen a few of his videos where I subsequently got one of those cards in my hands and discovered that he guessed wrong about what certain components do. The ones on the 290X and 295x2s being what comes to mind first.

The card in this thread is of a different design from the nvidia reference design (which is shared with the 780 Ti and contemporary Titans). The reference design appears to be a bit less prone to failure than these MSI cards are, in part, I suspect, because nvidia used better components than MSI did, and also didn't insist on totally stopping the fans. There are likely some other contributing factors when they actually fail, such as weak power supplies causing the cards to run excessively hot.

The 10 series cards are definitely better designed than earlier generations, with one of the drawbacks being that nvidia appears to have gone ham with the proprietary parts, similar to the way Apple has done this with some of their PCB components. I have a 20 series card that uses these oddball inductors where I had to figure out which factory over in China makes them and source a whole spool of them direct from the factory. I should make a thread about that card some time. It's an interesting case.
 
Fair enough, let's go "big" then!

I was curious about whether intense testing would just burn the next components down the line but it seems more productive to try replacing any suspect parts first. I'm driving displays with an old GTX 660 I dug out (though the GTX 560 is still kicking too). I'll ride it out for another generation or two if this 980 ti doesn't make it. I'd say they don't make them like they used but I think I've read somewhere that manufacturers are either underspec'ing or pushing components to the limit too much on the highest end cards each generation leading to the failures that seem to be increasing in trend (at least anecdotally on enthusiast/owner forums) whereas lots of lower to mid range cards seem to keep chugging along.



I think that was supposed to be a selling point on at least a few Maxwell cards. I did run it with 30% instead of 0% minimum in a custom fan curve in the past 2 years but perhaps it was left in default profile the first couple of years when my sister originally purchased it (I bought it from her in 2018 when she upgraded to a Pascal card). My wife has a EVGA SSC ACX 970 in her system that also defaults to "fanless" operation so we also overrode it with a profile. They're so quiet at low speeds I'm surprised any companies wanted to pitch "fanless".

Ok, so I swapped another mosfet onto the card, along with the doubler and dual phase controller chip, and then tested it. The same thing happened, where the affected phase killed its FET package almost immediately, which triggered my power supply's short detection. So, I removed the dead FET package again, and tested the card, and it "works."

Also, I figured out that I can install the heatsink sideways, and still have access to all the VRM components while the card is running. Handy for testing.
hsf_sideways.jpg

So, at this point, I think this about the best I can do with this card. I'll clean the flux off and reassemble it tomorrow, but it really does seem to work as it is.
 
So, I removed the dead FET package again, and tested the card, and it "works."
Well “works” sure beats how it was 😀
Also, I figured out that I can install the heatsink sideways, and still have access to all the VRM components while the card is running. Handy for testing.
Ah, good use of the square shaped mount holes. I think I’ve seen a repair video use a Morpheus II vertically the same way. But do the chokes get hot this way? If I recall, MSI used a thermal pad strip to make contact from the heatsink to the chokes on these cards.
 
But do the chokes get hot this way? If I recall, MSI used a thermal pad strip to make contact from the heatsink to the chokes on these cards.
The chokes don't usually require cooling at all. When a thermal pad is used, it's usually there mainly to support the heatsink, or keep it from rattling.

I got the card reassembled, plugged it back in, and gave it a test this afternoon. It seems to work, and even boosts up to 1530 or so, which I gather is this card's extra spicy boost clock.

I think the power measurement is a bit off, since it claims it's only pulling about 80% of its power envelope, but gives the perfcap reason as "power." Still, it survived a few minutes' worth of Heaven before I shut it down, and maxed out around 70C on the core with a very aggressive fan curve. I'll do a little more testing this evening, but it's basically ready to go back to you whenever you want it. I wish I could claim I fixed it 100%, but I have a feeling this may ultimately be a PCB issue, and not something I can fix by replacing components.
 
I think the power measurement is a bit off, since it claims it's only pulling about 80% of its power envelope, but gives the perfcap reason as "power." Still, it survived a few minutes' worth of Heaven before I shut it down, and maxed out around 70C on the core with a very aggressive fan curve. I'll do a little more testing this evening, but it's basically ready to go back to you whenever you want it. I wish I could claim I fixed it 100%, but I have a feeling this may ultimately be a PCB issue, and not something I can fix by replacing components.

Interesting. I wonder if it scales reporting, lowering cap to 80% maybe reports 64%? I can test it out if you’re tired of it 😀

In any case, I really appreciate that you got it running again! Let me know when/where is convenient for you to meet.
 
Alright, did a little more testing. I wouldn't go overclocking it, but it seems fine otherwise. It survived a run of Firestrike, and ten minutes of Tomb Raider 2013 without issue. it was at that point that my ludicrously expensive Xbox Elite controller stopped working, so I gave up and shut it off.

The Firestrike score it gets is dragged down by my test bench's slow CPU, but it beat the R9 390 I tested last time by about 1000 points.
https://www.3dmark.com/3dm/48664766

I'll reach out to you later this week about where we can meet to hand it back to you.
 
Hey RazorWind, I was wondering if you could help me out. So here are some images for an EVGA 980Ti SC model. On the back you can see that Q502 is blown up (others shifted due to my heat gun). I attempted to remove the blown up MOSFET and ended up removing a bit of PCB with it. So now the two S and S leads will not make contact. How do I go about fixing the connection if at all?
20210211_211914.jpg20210212_173910.jpg
I was also wondering if you could help me understand what happened. Basically here is the story to how I got here. I was given the GPU and told that it would not start when the 8-pin PCIE was connected to the GPU. I determined that there was a short in V13 power phase as it is in the image below. I first removed the phase at V13, and when it was removed the GPU would at least boot and stay happy in windows until I had a load on it. Once a load was applied, it would shutoff the PC and reboot. V13 was replaced and the resistance was verified against the others and seemed all good, however, the shutdown at load still happened.
20210212_174026.jpg
The GPU still shutting down made me think I could only get as much info with resistance and I needed to test the voltages. I slapped on a AIO and started probing the points on the GPU when there was power going to it. I found the 12V rail properly supplying the voltages right before the power phases. Testing the voltages at the inductor, the top one, near V98, had a voltage of .953v and the one below it at 1.012v while all the other 4 had the MSI Afterburner set voltage of .897v. At this point, I decided to remove V98 as it made the most sense to be faulty since in my head the circuit most likely split from each phase to the two inductors. Basically V98 and V8 both distribute power to the top most inductor. I removed V98 as in the image below and turned the PC on. At this point, the two inductor's closest to V98 read 12v, which I thought wasn't great, but the 330 resistor on the other side still read .897v so I thought I was safe. Went away from the bios into windows and for about 30 seconds it seemed fine, then the screen went dark. I rebooted the PC and poof, that first image of the Q502 being blown happened. So why did removing the phase at V98 blow up Q502? How are the two related? I am assuming it has to do something with the 12v rail power delivery, but why did removing this V98 finally kill it and not when I had V13 completely off? It just doesn't make much sense since like you, I thought I could at least run without a phase before things went completed catastrophic. Here was me with the V13 soldered on, but no luck at load.

20210212_174030.jpg
 
Hey RazorWind, I was wondering if you could help me out. So here are some images for an EVGA 980Ti SC model. On the back you can see that Q502 is blown up (others shifted due to my heat gun). I attempted to remove the blown up MOSFET and ended up removing a bit of PCB with it. So now the two S and S leads will not make contact. How do I go about fixing the connection if at all?
View attachment 328966View attachment 328965
I was also wondering if you could help me understand what happened. Basically here is the story to how I got here. I was given the GPU and told that it would not start when the 8-pin PCIE was connected to the GPU. I determined that there was a short in V13 power phase as it is in the image below. I first removed the phase at V13, and when it was removed the GPU would at least boot and stay happy in windows until I had a load on it. Once a load was applied, it would shutoff the PC and reboot. V13 was replaced and the resistance was verified against the others and seemed all good, however, the shutdown at load still happened.
View attachment 328968
The GPU still shutting down made me think I could only get as much info with resistance and I needed to test the voltages. I slapped on a AIO and started probing the points on the GPU when there was power going to it. I found the 12V rail properly supplying the voltages right before the power phases. Testing the voltages at the inductor, the top one, near V98, had a voltage of .953v and the one below it at 1.012v while all the other 4 had the MSI Afterburner set voltage of .897v. At this point, I decided to remove V98 as it made the most sense to be faulty since in my head the circuit most likely split from each phase to the two inductors. Basically V98 and V8 both distribute power to the top most inductor. I removed V98 as in the image below and turned the PC on. At this point, the two inductor's closest to V98 read 12v, which I thought wasn't great, but the 330 resistor on the other side still read .897v so I thought I was safe. Went away from the bios into windows and for about 30 seconds it seemed fine, then the screen went dark. I rebooted the PC and poof, that first image of the Q502 being blown happened. So why did removing the phase at V98 blow up Q502? How are the two related? I am assuming it has to do something with the 12v rail power delivery, but why did removing this V98 finally kill it and not when I had V13 completely off? It just doesn't make much sense since like you, I thought I could at least run without a phase before things went completed catastrophic. Here was me with the V13 soldered on, but no luck at load.

View attachment 328967
Ok, first things first, if you're working on this card with an actual heat gun, stop. You need a hot air rework station. The reference 980 Ti is notoriously fragile - you do not want to be applying heat to it indiscriminately. Using a heat gun for this is like trying to carve delicate bird sculptures with a chainsaw.

Next, when you're taking voltage measurements, it matters which side of the inductor you take them on. The function of the inductors is to turn the chopped up alternating 12V/0V output from the power stages into a smooth ~1.0V, so if you're taking your measurements on the side near the power stages, you need to be using AC mode. If you're taking your measurements on the output side, you should be using DC mode.

It should be noted that this board is sort of an oddball design, with two banks of four power stages and three inductors. What that means is that you can't just remove a phase and expect it to work correctly, because the phases share components.

To be honest, I don't know for sure what the purpose of Q502 and Q501 is. I suspect it has to do with balancing current draw between the three 12V inputs, and they tend to fail when one of the phases is missing because it throws that balance out of whack. It may also be exacerbated by weak power supplies, since one of the main reasons that MOSFETs fail is that the gate voltage drops too low, and causes the on resistance to be too high.

Your best to get the card working again is to replace all eight FDMF6823As and the memory MOSFETs. The FETs that were used on these cards are very fragile, and if you got it hot enough remove U13 and U98, you most likely damaged all the others in the process, if they weren't already degraded anyway. The proper way to replace them is to use a preheater to heat the board from the back and then use a hot air rework machine to get them just hot enough to melt the solder from above before removing and replacing them with tweezers. Use a good quality flux when you install the new ones, or you'll have all sorts of trouble with getting the QFN pads to solder right.

With respect to Q502, it wasn't pretty I was able make a replacement trace out of copper tape on Solan's card. I would try that first, but I bet the card would work without it.

Note that you're missing the two 0402 resistors that go next to Q502. Also, the large black component marked 330 next to the inductor in your last picture is a capacitor, not a resistor. Its function is to help stabilize the voltage being supplied to the core so that it doesn't drop too low when the GPU suddenly starts doing work.
 
Ok, first things first, if you're working on this card with an actual heat gun, stop. You need a hot air rework station. The reference 980 Ti is notoriously fragile - you do not want to be applying heat to it indiscriminately. Using a heat gun for this is like trying to carve delicate bird sculptures with a chainsaw.

Next, when you're taking voltage measurements, it matters which side of the inductor you take them on. The function of the inductors is to turn the chopped up alternating 12V/0V output from the power stages into a smooth ~1.0V, so if you're taking your measurements on the side near the power stages, you need to be using AC mode. If you're taking your measurements on the output side, you should be using DC mode.

It should be noted that this board is sort of an oddball design, with two banks of four power stages and three inductors. What that means is that you can't just remove a phase and expect it to work correctly, because the phases share components.

To be honest, I don't know for sure what the purpose of Q502 and Q501 is. I suspect it has to do with balancing current draw between the three 12V inputs, and they tend to fail when one of the phases is missing because it throws that balance out of whack. It may also be exacerbated by weak power supplies, since one of the main reasons that MOSFETs fail is that the gate voltage drops too low, and causes the on resistance to be too high.

Your best to get the card working again is to replace all eight FDMF6823As and the memory MOSFETs. The FETs that were used on these cards are very fragile, and if you got it hot enough remove U13 and U98, you most likely damaged all the others in the process, if they weren't already degraded anyway. The proper way to replace them is to use a preheater to heat the board from the back and then use a hot air rework machine to get them just hot enough to melt the solder from above before removing and replacing them with tweezers. Use a good quality flux when you install the new ones, or you'll have all sorts of trouble with getting the QFN pads to solder right.

With respect to Q502, it wasn't pretty I was able make a replacement trace out of copper tape on Solan's card. I would try that first, but I bet the card would work without it.

Note that you're missing the two 0402 resistors that go next to Q502. Also, the large black component marked 330 next to the inductor in your last picture is a capacitor, not a resistor. Its function is to help stabilize the voltage being supplied to the core so that it doesn't drop too low when the GPU suddenly starts doing work.
Thank you for the response. First, you are absolutely right, I was in DC Voltage mode on the inductors measuring closest to the power stages. The PSU is a RM750, and was able to shut off when the GPU went boom. Let me see if I fully understand the rest. 1) Get a hot rework station, heatgun probably destroyed everything around 2) Replace all power stages 3) Add back the little 402 resistors 4) Try to run GPU without the Q502 mosfet. Profit? Again, I got the card for free, at this point, its a fun experiment.
 
Thank you for the response. First, you are absolutely right, I was in DC Voltage mode on the inductors measuring closest to the power stages. The PSU is a RM750, and was able to shut off when the GPU went boom. Let me see if I fully understand the rest. 1) Get a hot rework station, heatgun probably destroyed everything around 2) Replace all power stages 3) Add back the little 402 resistors 4) Try to run GPU without the Q502 mosfet. Profit? Again, I got the card for free, at this point, its a fun experiment.
It's not so much that the heat gun probably destroyed everything as that the heat gun isn't precise enough. The power ICs you need to replace are very fragile, and you need to be deliberate about how you heat them up and cool them down. If you read the datasheet, it gives a temperature over time profile that the factory would have used to program their SMD soldering machines. You need to stay pretty close to that profile to avoid destroying your new ones when you try to solder them on. Note that I also mentioned a PCB preheater. Experience trying to replace these particular ICs on this particular board design tells me that you really need a preheater, in addition to the hot air rework station. There's too much copper in that part of the board to use a hot air station alone.

Otherwise, yeah, pretty much. Replace the FETs in all ten phases (the memory ones should be easy), put the missing components back, and see if if the card works without Q502. If you can replace the FETs successfully, I bet it does. If it doesn't, you may be able to fashion a replacement trace out of some copper tape, like I did with Solan's card. Be real careful about the QFN pins on the FDMF6823As. If you have too much solder on the center pad, it can be tricky to get the perimeter pins to solder securely.
 
I'd say they don't make them like they used but I think I've read somewhere that manufacturers are either underspec'ing or pushing components to the limit too much on the highest end cards each generation leading to the failures that seem to be increasing in trend (at least anecdotally on enthusiast/owner forums) whereas lots of lower to mid range cards seem to keep chugging along.
#1 reason why I don't buy anything other than reference cards right here. If the reference card has a problem (like the RX 480 series drawing too much power from the PCIe connector) I just avoid it altogether. Too much variability on partner cards.

This is an amazing thread, glad it got necroed else I'd have missed it.
 
#1 reason why I don't buy anything other than reference cards right here. If the reference card has a problem (like the RX 480 series drawing too much power from the PCIe connector) I just avoid it altogether. Too much variability on partner cards.

This is an amazing thread, glad it got necroed else I'd have missed it.
Actually, the 980 Ti reference board is probably a contender for "worst design" after these fancy MSI ones. Every single one will eventually have the power stage closest to the slot connector fail - particularly with the EVGA cooler, which kind of sucks, apparently. When that happens, it often destroys the PCB unless the power supply has a super fast OCP, so you can't even repair it. The good and bad designs are obviously different every generation, but I'd be willing to bet that the nvidia 30 series reference design with the tiny boards will be among the more failure-prone ones in four or five years, with everything so tightly packed on there.

AMD cards are designed in a totally different way (better, IMHO), but the reference design is usually the best because AMD tends to over-build the reference board, and leaves the cost optimization up to the board manufacturers. This was one of the reasons we saw those super cool International Rectifier DirectFETs on the 290/X and 295X2. They're quite expensive, but they're tough as nails, and the way they use the source terminal as a heatsink is pretty awesome. They aren't very efficient, and they'd be hard to design as a power stage, but I wish that design had evolved further. In contrast, there is no reference 390X as far as I know, but the most common version, the "Nitro," uses a significantly weaker and cheaper power stage IC, the name of which I cant remember now.
 
Actually, the 980 Ti reference board is probably a contender for "worst design" after these fancy MSI ones. Every single one will eventually have the power stage closest to the slot connector fail - particularly with the EVGA cooler, which kind of sucks, apparently. When that happens, it often destroys the PCB unless the power supply has a super fast OCP, so you can't even repair it. The good and bad designs are obviously different every generation, but I'd be willing to bet that the nvidia 30 series reference design with the tiny boards will be among the more failure-prone ones in four or five years, with everything so tightly packed on there.
Well damn, if I would have known that I'd have opted for a 1070 FE instead. Good to know I'm sitting on a ticking time bomb. Upgrading will be a priority, but in this market, unless a kind soul is willing to part with their 1080 or 1080 Ti at a sane price here on the forums I'm afraid I'm stuck with what I got.
 
Well damn, if I would have known that I'd have opted for a 1070 FE instead. Good to know I'm sitting on a ticking time bomb. Upgrading will be a priority, but in this market, unless a kind soul is willing to part with their 1080 or 1080 Ti at a sane price here on the forums I'm afraid I'm stuck with what I got.
If it's still working for you, don't worry too much about it. You can probably help prevent the failure by making sure the fans always spin, even when it's idling, and obviously venting the case well.

I have a 2080 that defaults to zero fan RPM when it's idle, and the heatsink on that thing gets wicked hot, even in an open case. I'm not sure if your card does that too, but it wouldn't really shock me. I seem to recall that was about when everyone started trying to offer a "silent" BIOS as a selling point.
 
Will the MSI 970 heatsink work on the 980ti? I have a hybrid MSI 980ti and want to put the red and black heatsink on it, I found one for sale but can't find any info on if it will fit. Is it possible?

The model number is: MS-V323 V 1.0

Also if anyone has the 980ti heatsink from a broken card, would you be willing to sell it?
 
Last edited:
Hi RazorWind. What ohm measurement tool do you have ?
How do you know which points on GPU are relevant for measurement?
 
The multimeter you see me using in this thread and others is a Voltcraft VC 850.

I know which points to measure on the graphics card based on my knowledge of what each component does and how the card works, as a system. There is, unfortunately, not really a shortcut that's going to allow you to go from zero knowledge of repairing electronics to fixing a dead graphics card and getting it back to mining or gaming or whatever. You need to learn how circuits work, and then what circuits a graphics card has, and how they're related to each other.
 
Will the MSI 970 heatsink work on the 980ti? I have a hybrid MSI 980ti and want to put the red and black heatsink on it, I found one for sale but can't find any info on if it will fit. Is it possible?

The model number is: MS-V323 V 1.0

Also if anyone has the 980ti heatsink from a broken card, would you be willing to sell it?

The MSI 970 and 980 ti (gaming and lightning) seem to use the same layout Twin Frozr cooler but they are based on the same non-reference design. The hybrid MSI 980 ti, if you mean the "sea hawk" uses a reference design so I don't think the twin frozr will fit the rest of the board correctly.

Also small update: my 980 ti actually went kaput while idle earlier this month (despite fans running at 50% minimum all the time). I got some good use out of it out last year after Razorwind's repair, I got through all of SOTTR and even ran some cycles of Deep Image Prior on the CUDA cores.
 
The MSI 970 and 980 ti (gaming and lightning) seem to use the same layout Twin Frozr cooler but they are based on the same non-reference design. The hybrid MSI 980 ti, if you mean the "sea hawk" uses a reference design so I don't think the twin frozr will fit the rest of the board correctly.

Also small update: my 980 ti actually went kaput while idle earlier this month (despite fans running at 50% minimum all the time). I got some good use out of it out last year after Razorwind's repair, I got through all of SOTTR and even ran some cycles of Deep Image Prior on the CUDA cores.
Damn, sucks it died again. Time to sell on ebay for parts!
 
The MSI 970 and 980 ti (gaming and lightning) seem to use the same layout Twin Frozr cooler but they are based on the same non-reference design. The hybrid MSI 980 ti, if you mean the "sea hawk" uses a reference design so I don't think the twin frozr will fit the rest of the board correctly.

Also small update: my 980 ti actually went kaput while idle earlier this month (despite fans running at 50% minimum all the time). I got some good use out of it out last year after Razorwind's repair, I got through all of SOTTR and even ran some cycles of Deep Image Prior on the CUDA cores.
Gah, that's a bummer! Is there any chance I might be able to get it back from you, so we could follow up in the thread, and see what failed?
 
Gah, that's a bummer! Is there any chance I might be able to get it back from you, so we could follow up in the thread, and see what failed?
Sure, I recently moved far from Austin but would be happy to mail it to you for knowledge and thread's sake if there's interest. I would describe the current state as similar to this past post: https://hardforum.com/threads/graph...golden-edition.1993063/page-2#post-1044651534 where you said a bad phase kept destroying a new MOSFET and the PSU short detection kicks in as soon as power-on is attempted. I tried the card on two different known-good motherboards and PSUs. Perhaps yet another power phase or two went bad.
 
Hi, I am in electronics professionally all my life, just watched your video, and your conclusion of why such damage this happened, it did motivated me to add my comment / reply.

Modern NVIDIA cards (GTX1060 included) they come with power limiter control, or better said an power limiter circuit = additional electrical protection.
An PSU with insufficient 12V Rail, this will drive CPU and entire PC to have Blue screens, this is enough as indication for having such a PSU repaired or replaced.

I did recently switch just for testing, an reputable 780W PSU of 2008 (past year were refurbished by my hands) with Corsair CX750 this has dedicated DC/DC converter at 12V Rail ( better stability ).
The PC this working identically well, other that the Corsair CX750 of 2013, this using less energy from mains (wall-plug), lesser standby and load power in watts.

I think as better advice, that consumers they should never buy PSU from brands with zero reputation.

MSI this is well know as air cooling innovator, at what it thinks as Top tear no matter GPU model.
They did also excellent work at cooling, for a rare version of GTX 1060 OC 6GB ( iGAMER), and they were awarded about it.
https://www.ittsb.eu/forum/index.php?topic=1640.0

Thanks for the footage.
 
Hi RazorWind, Thank you for such detail instruction. I am following this thread and your other EVGA GTX 980 TI repair video as I have both type of broken card. I just recently got my first rework hot air gun and is having the hardest time desoldering/removing the MOSFET on the cards. If the solder is not flowing, should I continue to bump the temperature up? I tried from 350 to 400C and the chip just won't budge.
 
Hi RazorWind, Thank you for such detail instruction. I am following this thread and your other EVGA GTX 980 TI repair video as I have both type of broken card. I just recently got my first rework hot air gun and is having the hardest time desoldering/removing the MOSFET on the cards. If the solder is not flowing, should I continue to bump the temperature up? I tried from 350 to 400C and the chip just won't budge.
Did you also comment on the YT video?
 
Decided to work on my friend's MSI card again tonight. Fix the back side after receiving the parts. However, I still have a short. Also measuring the resistance from output side of the CAP to ground, I got 0.1 ohm. Not sure if that means the GPU chip itself is fried.
 
After reading the thread so far I feel glad that I went for an AIO-cooled MSI 980 Ti, quality PSU (Seasonic 660p), in a case with plenty of airflow.

I do have a hot air rework station, but swapping MOSFETs in those annoying SMD packages is not my idea of 'fun' :)

Also RIP for the 980 Ti that got hit with 12V on the core. I'm surprised it didn't acquire any sudden 'vent holes' from that event.
 
After reading the thread so far I feel glad that I went for an AIO-cooled MSI 980 Ti, quality PSU (Seasonic 660p), in a case with plenty of airflow.

I do have a hot air rework station, but swapping MOSFETs in those annoying SMD packages is not my idea of 'fun' :)

Also RIP for the 980 Ti that got hit with 12V on the core. I'm surprised it didn't acquire any sudden 'vent holes' from that event.

it did..was attempt to fix it last night...lol
 

Attachments

  • S__4710402.jpg
    S__4710402.jpg
    516.6 KB · Views: 1
Also RIP for the 980 Ti that got hit with 12V on the core. I'm surprised it didn't acquire any sudden 'vent holes' from that event.
They usually survive that, actually. Assuming there's no damage to the PCB, you can usually remove the failed high side FET and the card will work again.
Decided to work on my friend's MSI card again tonight. Fix the back side after receiving the parts. However, I still have a short. Also measuring the resistance from output side of the CAP to ground, I got 0.1 ohm. Not sure if that means the GPU chip itself is fried.
Definitely a short - which cap are you referring to?
 
They usually survive that, actually. Assuming there's no damage to the PCB, you can usually remove the failed high side FET and the card will work again.

Definitely a short - which cap are you referring to?
the cap to the left side of the GPU mosfets. Basically what you check on your first video.
 
the cap to the left side of the GPU mosfets. Basically what you check on your first video.
Man, that's a bummer. It could be the core that's shorted, or it could be something else. Voltage injection will be your friend there. 1.0 volts.
 
Hi guys!

Wow, I didn't expect so many people having the same trouble with 980 cards, especially the "Ti" models and related ones - I'm glad to have stumbled upon this thread!

A colleague of mine had a 980Ti (Founders Edition = reference design) laying around, which broke the other day when the curcuit-breaker in his apartment went down. After he put it back on/in he could smell electronic smoke coming from his PC, which he identified was in fact the GPU. Since he then bought himself an RTX3090 for cheap (before the ongoing crypto-hype), he doesn't need the old 980Ti card anymore and gave it to me as a present. Disclaimer: trying to get this card back to life is just a nice way for me the relax and learn something new. I'm in no need of the card itself as such and neither do I plan to sell it/make profit.

After a quick inspection I've realized the exact same issue like it happened for AtypicalComputers (a couple answers above) and Solan card. The AON7403 FET (Q502) blew up, while the other identical one (Q501) seems to be fine (optical wise):
c2a43e07-3074-4398-af80-794444a857a5.jpg


Since I do own a hot-air station for small hobby electronic projects and/or reworks, I continued to carefully remove the faulty FET. Unfortunately it seems the short, which caused the FET to blow up, was so hot, it caused the first layer of the PCB to "warp" and peel off while removing the FET. To be on the safe side I removed the good AON7403 aswell to eliminate any possible leftover short-causing parts:
ed4ca7cb-74f8-457e-adce-7703ef7a89fe.jpg


I didn't test the card yet after removing the FETs but considering it since I did notice RazorWind mentioned the card may run without those two MOSFETs aswell.

The second AON7403 (the one optically looking good) measures 2.21 MOhm on three pins, while the 4th is 0.L.

The measurements of the 12V rails do somehow not make sense to me - on the 6-pin PCI-E rail there still seems to be a short to ground:
9a211285-2ef9-41f1-b286-cb16c40a608e.jpg


Any idea on how to proceed? I'm tempted to hook up a lab power supply to the faulty 6-pin rail and apply some isopropyl alcohol on the FETs to find out if the short is caused by any of those on the front.

Thank you very much in advance!!!

Kind regards,
Chris

EDIT: correct wording/spelling
 
Last edited:
Hi guys!

Wow, I didn't expect so many people having the same trouble with 980 cards, especially the "Ti" models and related ones - I'm glad to have stumbled upon this thread!

A colleague of mine had a 980Ti (Founders Edition = reference design) laying around, which broke the other day when the curcuit-breaker in his apartment went down. After he put it back on/in he could smell electronic smoke coming from his PC, which he identified was in fact the GPU. Since he then bought himself an RTX3090 for cheap (before the ongoing crypto-hype), he doesn't need the old 980Ti card anymore and gave it to me as a present. Disclaimer: trying to get this card back to life is just a nice way for me the relax and learn something new. I'm in no need of the card itself as such and neither do I plan to sell it/make profit.

After a quick inspection I've realized the exact same issue like it happened for AtypicalComputers (a couple answers above) and Solan card. The AON7403 FET (Q502) blew up, while the other identical one (Q501) seems to be fine (optical wise):
View attachment 441142

Since I do own a hot-air station for small hobby electronic projects and/or reworks, I continued to carefully remove the faulty FET. Unfortunately it seems the short, which caused the FET to blow up, was so hot, it caused the first layer of the PCB to "warp" and peel off while removing the FET. To be on the safe side I removed the good AON7403 aswell to eliminate any possible leftover short-causing parts:
View attachment 441145

I didn't test the card yet after removing the FETs but considering it since I did notice RazorWind mentioned the card may run without those two MOSFETs aswell.

The second removed AON7403 seems to be good; it measures 2.21 MOhm on three pins, while the 4th is 0.L.

The measurements of the 12V rails do somehow not make sense to me - on the 6-pin PCI-E rail there still seems to be a short to ground:
View attachment 441146

Any idea on how to proceed? I'm tempted to hook up a lab power supply to the faulty 6-pin rail and apply some isopropyl alcohol on the FETs to find out if the short is caused by any of those on the front.

Thank you very much in advance!!!

Kind regards,
Chris
Don’t power it on! The 3ohms you’re seeing is actually the gpu core. I’m on my phone now, but I can elaborate later.
 
Back
Top