Graphics Card too Cheap to be true

SebP3R272

n00b
Joined
Sep 12, 2021
Messages
10
I saw the Graphics Card Necromancy - EVGA 980 Ti Classified and figure this is the right place to ask for inspiration (but as a forum newbie, please (please?) correct me). I was surfing the internet and found an EVGA GTX 980Ti for $50. The seller was up front that it was definitely not working, and looking for something to do during lockdown, I figured there would be less time consuming ways to spend my money so I bought it.

LED and Fan on, runs for 3 seconds or if driver not installed.jpg


I received the card, inspected it for obvious damage and finding no tell tale scorch marks or bruises, (other than a burn on the PCIE 12Volt rail) I took the chance at trying the card in my Razer core eGPU casing to see what success I might have (and later my regular desktop as a differential diagnostic that had identical results). To my delight, the card LED's lit up, the fan spins at a regular idle and it is detected by my operating systems (2 windows 10 installations and ubuntu linux). Once the drivers are downloaded and installed, windows throws an error 43 cannot start device, and the LED's switch off. Because this in my Razer core, I can connect/disconnect the thunderbolt cable to the same result every time under multiple versions of the Nvidia driver with the same result.

BUT, interestingly, when I plug the thunderbolt connection in, the card powers on, and there is a 3 second window where there is a brief "blip" of an output on the HDMI port (enough to wake the display) and then the card errors out again)

after 3 seconds and the driver is installed error 43.jpg


GPUz tells an interesting story. The card is detected, but the core frequency and memory are unknown.

after installation stop code 43.gif


If I uninstall the driver (so that it doesn't stop the device, and use the generic windows basic display adaptor) the LED stays on, GPUz can see a core frequency but still no memory.

without driver.gif


So the logical conclusion is - no power to the memory - right?

Now Youtube is full of videos explaining the dozens of cards that have had inductors and MOSFETS fail on the memory power delivery side, so I open up the card, and inspect these components more closely.
Still no visible damage, a muti-meter probe shows 12 Volts before the MOSFETS, and 1.5 Volts after the inductors. And 1.5 across the capacitors interspaced between the memory chips.

20210912_201757.jpg


I figure that there must be a bad connection or dust somewhere I cannot see. Having had success before resurrecting xbox 360's with a red ring of death by reflowing the solder, I clean the board with Isopropyl alcohol and (once dry) retry the board. no change

20210912_201813.jpg


I bake the board at 150 degrees Celsius for 30 minutes. but still no change. I'm looking for suggestions on what to try next. It seems like there is a damaged connection somewhere that is telling the board that there is a problem, and thats why windows is throwing the error 43, but other than systematically replacing every component on the board (working from the 12V PCE input) I can't see the answer....

Only damage 12V PCIE Pins.jpg


Thoughts?
 

Attachments

  • 20210912_201757.jpg
    20210912_201757.jpg
    525.9 KB · Views: 0
Last edited:
First, 150 degrees Celsius isn't hot enough to reflow the solder on this card. All you did by "baking" it was (probably) make the problem worse, whatever it is. In order to reflow the solder, you would need to get the solder up to at least 225 degrees celsius. Do not attempt that - it won't help you.

What resistance do you have to ground on those scorched PCI-E connector pins? Do not use short detection ("beep") mode. Use actual ohms mode, and report actual ohms, please.
 
Seconded on the baking, never do it. Picking random times and random temperatures from the quack electronics destroyers on the internet will just make your situation worse. Not to mention, you're contaminating your oven and house with lovely chemicals you should not be ingesting or inhaling, as well as offgassing plastics. You should probably clean your oven a few times before you use it to make food again.

And JFC, 30 minutes? Yeah those electrolytic/poly caps are done like dinner. The highest rated generally available electrolytic/poly capacitors are 125C, but those could be as low as 85C depending on what EVGA specced. Industrial reflow ovens are on the order of 4-5 minutes.

Coat in butter and microwave it.

Pfft, everyone knows Margarine works better. Thems partially hydrogenated soybean oils are key to making the bbqification process work faster so the electrons get to where they need to go and make your card back up to 100%

/stroke
 
First, 150 degrees Celsius isn't hot enough to reflow the solder on this card. All you did by "baking" it was (probably) make the problem worse, whatever it is. In order to reflow the solder, you would need to get the solder up to at least 225 degrees celsius. Do not attempt that - it won't help you.
Fair point, the consensus seems that this was a bad move, I have obviously been very lucky in the past with other repairs where it seems to have made a difference.
What resistance do you have to ground on those scorched PCI-E connector pins? Do not use short detection ("beep") mode. Use actual ohms mode, and report actual ohms, please.

PCIE 12 volt to ground is 3.4M ohms after the card has been disconnected all night, Falls gradually from there to to 19.14k ohms after 1 minute.
 
Fair point, the consensus seems that this was a bad move, I have obviously been very lucky in the past with other repairs where it seems to have made a difference.

People don't understand why it makes a device work again, it is not a repair, it is a temporary kludge that will fail again.

When a solder joint fails, especially BGA, the crack is microns thick. It doesn't take much to cause the two sides of the joint to expand enough to make contact again, which is what throwing it in the oven does. The heat can make the two sides stick together again weakly without reforming the joint, as would happen with a proper flux reflow with a hot air station or reballing and remounting the chip. The weak bond can work well enough to make the chip function again, but it won't last. More heat like from a GPU ASIC or mechanical movement can dislodge the parts that were conducting and cause joint failure again.

If the solder joint experiences high current, the weak connections made inside the cracked joint can burn from the high current being pushed through a path far too small.

Here are some SEM pictures of failed BGA joints.
https://www.researchgate.net/profil...lder-joint-progressive-crack-17-mm-BGA-V1.ppm
https://ai2-s2-public.s3.amazonaws....1e54d5dc1a3288705b86e323ed6fd/3-Figure6-1.png

150C is not going to fix that.
 
Check for components that are shorted/open that shouldn't be.

Reflash the BIOS of the card.
Reflashed the BIOS of the card with several compatible versions on TechPowerUp, but the symptoms remain the same. I've reverted to the same version that the card had when I received it.

If the power delivery circuit feeding the memory is failing under load when the card starts, should I try modifying the bios and under-volting the memory?
 
Reflashed the BIOS of the card with several compatible versions on TechPowerUp, but the symptoms remain the same. I've reverted to the same version that the card had when I received it.

If the power delivery circuit feeding the memory is failing under load when the card starts, should I try modifying the bios and under-volting the memory?
No, you should isolate the failed components and replace them. RazorWind does this for his own enjoyment and is kind enough to regularly share the shenanigans with us.
 
A reflash is something that I want to try. GPUz managed to pull the bios file from the card, so I should be able to push a new one. I'll give it a shot tomorrow and see if it makes any difference.
Honestly it might be easier/safer to just boot into linux and see if it works. I've bought a decent number of error 43 cards that are caused by a custom/unsigned BIOS file and work fine.
 
No, you should isolate the failed components and replace them. RazorWind does this for his own enjoyment and is kind enough to regularly share the shenanigans with us.
The first thing he should do is make sure the card has the proper BIOS image, if there's any suspicion it might not. This requires knowing what "brand" of 980 Ti this is. Download it from TechPowerUp and use nvflash to compare it to the one on the card.

My guess would be that they'll match. I'm not as up on my GPU mining knowledge as I probably should be, but I don't recall customizing the BIOS being that much of a thing with this generation of cards. Never hurts to check, though.

Once you're past that, if the suspicion is that you have a problem with the memory power rail (doubtful, if the card works at all), you would troubleshoot that. Take a static resistance measurement at the coils and compare to a known good value. With Samsung memory chips, a good value is ~15 ohms. Next step is to test for memory voltage when the card is misbehaving. You'll need to figure out which caps on the back are attached to the memory rail (do this with it powered off), but it's pretty easy to do this from the back, with the heatsink on.

Fair point, the consensus seems that this was a bad move, I have obviously been very lucky in the past with other repairs where it seems to have made a difference.


PCIE 12 volt to ground is 3.4M ohms after the card has been disconnected all night, Falls gradually from there to to 19.14k ohms after 1 minute.
That's a good value. My guess would be that there was some corrosion or something on the slot pins that this card was connected to, and that's why you have the scorch marks. If it was used for mining, it would have been run at basically full power for days or weeks on end, and that could eventually allow the contacts to get hot enough to do that.

Also, what Gigabite said about baking the card. That "technique" is successful in temporarily getting hardware working again that it persists as something people think they should try, in the absence of actual repair services.
 
The first thing he should do is make sure the card has the proper BIOS image, if there's any suspicion it might not. This requires knowing what "brand" of 980 Ti this is. Download it from TechPowerUp and use nvflash to compare it to the one on the card.

My guess would be that they'll match. I'm not as up on my GPU mining knowledge as I probably should be, but I don't recall customizing the BIOS being that much of a thing with this generation of cards. Never hurts to check, though.
Bios I have flashed is the same as a known good version from TechPowerUp that matches the version printed on the card. It doesn't look like the version it came with has been modified either when I look at it in the MaxwelBiosTweaker tool.

Once you're past that, if the suspicion is that you have a problem with the memory power rail (doubtful, if the card works at all), you would troubleshoot that. Take a static resistance measurement at the coils and compare to a known good value. With Samsung memory chips, a good value is ~15 ohms. Next step is to test for memory voltage when the card is misbehaving. You'll need to figure out which caps on the back are attached to the memory rail (do this with it powered off), but it's pretty easy to do this from the back, with the heatsink on.
Samsung memory chips on this board, from the coil (R33 inductor?) to ground it reads as 7.3 ohms. I don't have a board to compare to myself, but if anyone knows where to look, my google-fu didn't turn any values up.
 
Honestly it might be easier/safer to just boot into linux and see if it works. I've bought a decent number of error 43 cards that are caused by a custom/unsigned BIOS file and work fine.
I've had the same experience with hard drives and usb devices. Absolutely zero feedback or function on windows, but work perfectly in Linux. I have booted a live version of Ubuntu 18.4 LTS and it has the same symptoms as when it is booted in windows. Are there any tools under Linux that would give some sort of feedback on why the device wont start? Windows only seems to repeat error 43 forever across device manager and event viewer.

For devices that have so much technology built into them (their own BIOS), its really surprising that there aren't more comprehensive diagnostic tools - cars from the 15 years ago gives more feedback on problems!
 
Last edited:
I've had the same experience with hard drives and usb devices. Absolutely zero feedback or function on windows, but work perfectly in Linux.

This is because Windows is stupid. Windows has historically relied on int13h BIOS calls for the disk subsystem, and the IDE/SATA subsystem is not fault tolerant at all. No sanity checks were ever built into IDE/SATA, so if a device misbehaves, it can bring down the whole system with interrupt storms, or just telling the system the drive is endlessly busy or not ready. This is one of the reasons SMR drives cause systems to fail, because during housekeeping where data is being shuffled around between SMR/CMR regions or SMR rewrites, it tells the system the drive is busy/not ready and everything grinds to a halt waiting on the hard drive. The terrible design of IDE/SATA goes back to the beginning of the IDE standard when it was quite literally just a buffered ISA bus. All of the advanced features like DMA, UDMA and LBA addressing were stacked on top of the original dumb standard without addressing the elephant in the room that IDE was literally just the ISA bus.

Linux on the other hand stopped using BIOS calls for hardware access decades ago, which is why things like int13h drive geometry limits never affected Linux, and Linux can work around bugged BIOSes with things like broken ACPI tables.

USB is a lot like IDE, with no real fault tolerance. A misbehaving USB device can bring down the whole machine in similar ways. I can't tell you how many service calls I've been out on where someone forgot they had plugged a flash drive into the back of the machine months ago and the flash controller crashed, causing the flash drive to send garbage data to the host. The host doesn't know what is going on, so the whole USB subsystem locks up and causes the system to have erratic behavior. Random freezing, BSODs, keyboard/mouse not working, etc. Said person is freaking out their machine is failing, and it turns out to be the USB stick they forgot about months ago.

Linux generally doesn't suffer from misbehaving USB devices, it just won't mount them, or will unmount them if they crash and reboot, as what happens with USB flash drives and even external hard drives with SATA to USB adapter boards.


For devices that have so much technology built into them (their own BIOS), its really surprising that there aren't more comprehensive diagnostic tools - cars from the 15 years ago gives more feedback on problems!

Manufacturers don't want you fixing the products they sell you, hence why there are cryptic errors, no schematics, proprietary parts, etc. When it breaks, they want you to buy a completely new widget at full price from them. Welcome to the modern disposable society.

It's not like back in the 70s and 80s where almost everything had a schematic and parts availability. Didn't like what the market had to offer? Build your own and sell it. It's how the PC clone market got started.
 
Bios I have flashed is the same as a known good version from TechPowerUp that matches the version printed on the card. It doesn't look like the version it came with has been modified either when I look at it in the MaxwelBiosTweaker tool.


Samsung memory chips on this board, from the coil (R33 inductor?) to ground it reads as 7.3 ohms. I don't have a board to compare to myself, but if anyone knows where to look, my google-fu didn't turn any values up.
Your Google-Fu is weak. ;)

As it happens, I have a few of these 980 Tis lying around. The only one that works and has the Samsung memory chips shows 13.8 ohms to ground on the memory rail. The card in my thread about the 980 Ti Classified that you alluded to in your original post shows 9.xxx ohms, but it doesn't work, and it's always been my suspicion that that's too low. This is backed up by the docs for the EVGA E-Power, which say that you should be looking for 10-15 ohms on EVGA's 980 Tis equipped with Samsung memory.

What voltage do you have on the memory rail when the card is running? You should have 1.5 volts. If you have less than that, you need to troubleshoot why. Usually, when I see this, I'll go hunting for the source of the short but never find it, and eventually conclude that the short must be inside one of the big BGA ICs. In your cases, I could imagine it being a cracked MLC cap or something, since the card almost works normally.
 
  • Like
Reactions: travm
like this
This is because Windows is stupid. Windows has historically relied on int13h BIOS calls for the disk subsystem, and the IDE/SATA subsystem is not fault tolerant at all. No sanity checks were ever built into IDE/SATA, so if a device misbehaves, it can bring down the whole system with interrupt storms, or just telling the system the drive is endlessly busy or not ready. This is one of the reasons SMR drives cause systems to fail, because during housekeeping where data is being shuffled around between SMR/CMR regions or SMR rewrites, it tells the system the drive is busy/not ready and everything grinds to a halt waiting on the hard drive. The terrible design of IDE/SATA goes back to the beginning of the IDE standard when it was quite literally just a buffered ISA bus. All of the advanced features like DMA, UDMA and LBA addressing were stacked on top of the original dumb standard without addressing the elephant in the room that IDE was literally just the ISA bus.

Linux on the other hand stopped using BIOS calls for hardware access decades ago, which is why things like int13h drive geometry limits never affected Linux, and Linux can work around bugged BIOSes with things like broken ACPI tables.

USB is a lot like IDE, with no real fault tolerance. A misbehaving USB device can bring down the whole machine in similar ways. I can't tell you how many service calls I've been out on where someone forgot they had plugged a flash drive into the back of the machine months ago and the flash controller crashed, causing the flash drive to send garbage data to the host. The host doesn't know what is going on, so the whole USB subsystem locks up and causes the system to have erratic behavior. Random freezing, BSODs, keyboard/mouse not working, etc. Said person is freaking out their machine is failing, and it turns out to be the USB stick they forgot about months ago.

Linux generally doesn't suffer from misbehaving USB devices, it just won't mount them, or will unmount them if they crash and reboot, as what happens with USB flash drives and even external hard drives with SATA to USB adapter boards.




Manufacturers don't want you fixing the products they sell you, hence why there are cryptic errors, no schematics, proprietary parts, etc. When it breaks, they want you to buy a completely new widget at full price from them. Welcome to the modern disposable society.

It's not like back in the 70s and 80s where almost everything had a schematic and parts availability. Didn't like what the market had to offer? Build your own and sell it. It's how the PC clone market got started.

There's plenty of us fighting against planned obsolescence. The world doesn't need more garbage, it needs less waste and less low quality items.
 
Your Google-Fu is weak. ;)

As it happens, I have a few of these 980 Tis lying around. The only one that works and has the Samsung memory chips shows 13.8 ohms to ground on the memory rail. The card in my thread about the 980 Ti Classified that you alluded to in your original post shows 9.xxx ohms, but it doesn't work, and it's always been my suspicion that that's too low. This is backed up by the docs for the EVGA E-Power, which say that you should be looking for 10-15 ohms on EVGA's 980 Tis equipped with Samsung memory.

What voltage do you have on the memory rail when the card is running? You should have 1.5 volts. If you have less than that, you need to troubleshoot why. Usually, when I see this, I'll go hunting for the source of the short but never find it, and eventually conclude that the short must be inside one of the big BGA ICs. In your cases, I could imagine it being a cracked MLC cap or something, since the card almost works normally.
Memory voltage at the coil is 1.574V. this is consistent across the capacitors dotted around the memory chips as well.

I've had a closer look at the capacitors around the memory. I've cleaned off some stubborn residue that I suspect is just left over flux from manufacturing.
20210918_105605.jpg

On the back of the card I noticed some black residue obscuring part of a pad and capacitor.
20210918_111248.jpg
20210918_111704.jpg


I've cleaned it off, checked over the rest of the board and given it a test

20210918_120023.jpg


Hello new symptoms!

20210918_122046.jpg


Sadly, I suspect that this indicates that the memory itself has failed? Too much voltage to the wrong pin from whatever it was that I have cleaned off. I'm going to take another look at the board components and see if there is anything else that doesn't belong. I'll let you all know what I find.
 
Last edited:
Still no improvement. I have purchased a $50 paperweight.

Onto the next project?

Thank you for your suggestions everyone!
 
Given that the memory voltage appears to be a little high, I think I'd be checking for corrosion or damage on the sense pins for the memory power controller.

Hint: Look up the datasheet, and use it to figure out what the resistance on the sense pins should be. There's usually a formula.
 
Given that the memory voltage appears to be a little high, I think I'd be checking for corrosion or damage on the sense pins for the memory power controller.

Hint: Look up the datasheet, and use it to figure out what the resistance on the sense pins should be. There's usually a formula.
Thanks RazorWind. I've checked out the sense pins, it looks like there's a dry soldier joint on the sense pin and the LG2 pin according to the data sheet. Resistance is in the 275kohms range which makes me think this could be a potential contributor.
20210924_122846.jpg


There's also corrosion on the regulator chip? Just before the memory coil.
20210925_121507.jpg


And it has a resistance of 3Mohms compared to the mirrored chip with 7.6ohms. one of these chips could be at fault.
I am going to try to source some SMD soldering equipment and see if I can resolder/replace these chips.
I'll report back here if I have any success.
 
Is there any utility in Linux that can provide feedback on why the device won't start? Windows appears to keep repeating error 43 in the device manager and event viewer indefinitely.
 
Back
Top