Gigabyte RTX 2080 Super crashing under load

jasonexe

n00b
Joined
Mar 17, 2021
Messages
20
hi everyone,



I own a Gigabyte RTX 2080 Super 8G and since last year it has been crashing everytime i open a game or any 3D application. My screen goes black and gpu fans would go at full speed. I can’t shutdown my pc because that so the only way to turn it off and on again is by pressing the power button. I tried the gpu in different rigs i have but no luck, i also cleaned the gpu and replaced the thermal pads and thermal paste but that didn’t help either. The only solution i found is to undervolt the gpu to 0.900v (900mv) at 1820 mhz. The normal voltage for this card is 1.050v. This gpu never had overheating problems or anything so im not sure what the problem is. I heard that defective vrms are common in RTX 2070/2080/2070S/2080S gpus but I don’t really know how to find a defective VRM because all GPU vcore VRMS are connected together i believe, so are the memory VRMS. (tell me if I’m wrong) i found 3 vrms on the left side of the pcb and they all have a different voltage reading, I’m not sure if that is normal but i will post a picture of the pcb with the details.
Any help would be appreciated.

Kind Regards,

Jason
 

Attachments

  • ACF7C55E-8B38-4E24-BE10-C8AD5DAD57CD.jpeg
    ACF7C55E-8B38-4E24-BE10-C8AD5DAD57CD.jpeg
    951.3 KB · Views: 1
When you reduce the core voltage, does it work normally, other than being just a little slower?
 
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
 
I
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
900mv was the stable voltage i could get, going higher then that will give me a black screen and the fans run at full speed. If i load a game stock voltage it will crash after 0-10 seconds, I actually didnt try to measure the vrm output voltage when the gpu crashes, would be a good idea. I will give it a try tomorrow and let you know because it’s pretty late here.
 
Last edited:
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
hi, i checked the voltage from the vcore and memory vrms when it crashes and i didn’t get any voltage readings.
 
That's probably good news, in the sense that it means the core and memory controller IC is most likely shutting down due to one of its various built-in protection functions.

I believe the controller on the 2080 is a UP9511 or UP9512. Datasheet for the 9511 linked below.
http://www.icware.ru/pdf/0004239.pdf

If you read page 11, you'll see that it has several protection modes that can cause it to shut down. My guess would be that it's shutting down due to over voltage (which it may be measuring incorrectly), but it's also possible that it's shorted internally, and the current leakage inside it is heating it up, and causing it to shut down prematurely.

Try this: Unplug the OS drive and power the system up and let it sit there at the boot screen where it complains that there's no OS drive. Measure the core and memory voltage and report back.

As I recall, you should be looking for about 1.35V on the memory rail, and 1.063V on the core rail. If you have something way off from this, then the first thing to check is the tiny resistors around the UP9511. There is a specific value that they need to be in order to properly calibrate the UP9511's current and voltage sensing, and if they're off, then its behavior won't be correct. Maybe check for corrosion in that area?
 
That's probably good news, in the sense that it means the core and memory controller IC is most likely shutting down due to one of its various built-in protection functions.

I believe the controller on the 2080 is a UP9511 or UP9512. Datasheet for the 9511 linked below.
http://www.icware.ru/pdf/0004239.pdf

If you read page 11, you'll see that it has several protection modes that can cause it to shut down. My guess would be that it's shutting down due to over voltage (which it may be measuring incorrectly), but it's also possible that it's shorted internally, and the current leakage inside it is heating it up, and causing it to shut down prematurely.

Try this: Unplug the OS drive and power the system up and let it sit there at the boot screen where it complains that there's no OS drive. Measure the core and memory voltage and report back.

As I recall, you should be looking for about 1.35V on the memory rail, and 1.063V on the core rail. If you have something way off from this, then the first thing to check is the tiny resistors around the UP9511. There is a specific value that they need to be in order to properly calibrate the UP9511's current and voltage sensing, and if they're off, then its behavior won't be correct. Maybe check for corrosion in that area?
I unplugged my drive and booted up the pc, i got 760mv from vcore and 1355mv from memory. I also took a picture of where the UP9512R is located. I didn’t see any corrosion or anything.
 

Attachments

  • B1DCB43B-67B6-4F80-B981-E2660A55E180.jpeg
    B1DCB43B-67B6-4F80-B981-E2660A55E180.jpeg
    782.2 KB · Views: 0
I unplugged my drive and booted up the pc, i got 760mv from vcore and 1355mv from memory. I also took a picture of where the UP9512R is located. I didn’t see any corrosion or anything.
That sounds like it's probably normal. There are a few things you can try next, but one thing I might try is to let it sit there at the post screen for a minute or two, and then feel the back of the board for any obvious hot spots. If you find any, troubleshoot that phase.

The power stages almost certainly have a thermal protection feature of their own, and I remember one card that I fixed once where one of the phases' bootstrap capacitors was bad, causing the FETs on that phase to run craaaaaazy hot, because the gate voltage was lower than expected. Also check behind the UP9512. That has a thermal protection feature too.
 
That sounds like it's probably normal. There are a few things you can try next, but one thing I might try is to let it sit there at the post screen for a minute or two, and then feel the back of the board for any obvious hot spots. If you find any, troubleshoot that phase.

The power stages almost certainly have a thermal protection feature of their own, and I remember one card that I fixed once where one of the phases' bootstrap capacitors was bad, causing the FETs on that phase to run craaaaaazy hot, because the gate voltage was lower than expected. Also check behind the UP9512. That has a thermal protection feature too.
i will take a look tomorrow and let you know if i find any hot spots. Also, where are the bootstrap capacitors located exactly?
 
The bootstrap capacitors should be located very close to their respective VRM power stage ICs. You'll have to look up the datasheets and figure out which ones they are yourself, but I may be able to help you if can you share a closeup photo of the power stages, or read off the markings and post them.
 
The bootstrap capacitors should be located very close to their respective VRM power stage ICs. You'll have to look up the datasheets and figure out which ones they are yourself, but I may be able to help you if can you share a closeup photo of the power stages, or read off the markings and post them.
i booted my pc and touched the backside of the core vrms and didn’t really feel abnormal temperatures, same for the UP9512R IC. I also took a picture of the VRMS that is being used. It’s the SIC788A
 

Attachments

  • 376F6D27-D715-4C16-9EDB-EB5617BCF40C.jpeg
    376F6D27-D715-4C16-9EDB-EB5617BCF40C.jpeg
    795.3 KB · Views: 0
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
 
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
I will take a look at it, I’m not sure how to locate the current sense resistor or other components by looking up the datasheet of the controller/vrms, its new for me .
 
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
I used my microscope to find pin 4 from the core vrm and noticed that the connection goes to the other side of the pcb (i think). But I can’t seem to find a capacitor connected to it. Its difficult to tell. I will upload a image of the back side from where the vrm is located and also the side from the vrm itself.
 

Attachments

  • DF7EB88D-A3F5-4880-89C1-CE04242D9A03.jpeg
    DF7EB88D-A3F5-4880-89C1-CE04242D9A03.jpeg
    436.4 KB · Views: 0
  • FC8039B1-C7E0-47E4-B17D-3C6A19B32EED.jpeg
    FC8039B1-C7E0-47E4-B17D-3C6A19B32EED.jpeg
    345.6 KB · Views: 1
Pin 4 is on the other edge visible in your photo - the edge perpendicular to the markings. It looks like it also connects directly to a through hole, but the cap usually very close. I'd look for any MLC caps directly behind that IC, and use a multimeter to figure out which ones are connected to that pin.

Something else that occurred to me that might help narrow this down - if you look at the board power reading in GPU-Z while the card is working, what do you get? If you're undervolting, you should be well under 100% of the power limit. I don't remember what that is in watts, but it should be pretty low. Check and see if the numbers you get there look sane, in watts and in and percentage.
If you have higher reported power consumption than is possible at the given voltage, then that suggests a problem with current sensing somewhere on the board.
 
Pin 4 is on the other edge visible in your photo - the edge perpendicular to the markings. It looks like it also connects directly to a through hole, but the cap usually very close. I'd look for any MLC caps directly behind that IC, and use a multimeter to figure out which ones are connected to that pin.

Something else that occurred to me that might help narrow this down - if you look at the board power reading in GPU-Z while the card is working, what do you get? If you're undervolting, you should be well under 100% of the power limit. I don't remember what that is in watts, but it should be pretty low. Check and see if the numbers you get there look sane, in watts and in and percentage.
If you have higher reported power consumption than is possible at the given voltage, then that suggests a problem with current sensing somewhere on the board.
Do you mean on the second picture i send? I used GPU-Z to monitor gpu power etc while running FireStrike on 3DMark and i noticed that it says Idle on ‘Perfcap Reason’ when the GPU was under load which is weird to me.
 
What happens if you run FurMark?
I ran FurMark for a few seconds and noticed that ‘PerfCap Reason’ went to Pwr, the TDP also went to +- 250W, 100% which is the limit for this card, i will post a screenshot below.
 

Attachments

  • 7FD6D0F0-B63A-493D-BB75-9E12555ED331.png
    7FD6D0F0-B63A-493D-BB75-9E12555ED331.png
    54 KB · Views: 1
Hello,

I have the same problem with the same Gigabyte GPU, did you find any other explanation?

Best regards,

Sheldon
 
Hello,

I have the same problem with the same Gigabyte GPU, did you find any other explanation?

Best regards,

Sheldon
Nope.. I’m suspecting that the VRMS of the gpu core might be the issue.. i will order new ones and replace them, i know this is a common issue with RTX 2070/2070S/2080/2080S gpus so i will give it a try.
 
Check if you still have warranty coverage, Gigabyte should have 3-year warranty on GPUs.
 
Probably not your case but i had my own situation that I will share.
Have an old HD7870 GHz edition in my arcade cabinet that started to flake out one day. It ran fine but would crash instantly on any 3D app and give the error code stating as much. It had one fan that had failed so I had mounted a separate non GPU powered fan (80mm powered by PS). Re-paste etc.. I did some driver wipes and even a few OS installs after that failed. Same problems. It wasn't over heating at all so the fan shouldn't have mattered at all.
I put it in the drawer and put in an even older card. It sat for a long time then I gave it a retry. This time I fully unplugged the onboard GPU fan header (splitter to two fans) and went with only my own rigged up 2X 80mm powered by MB or PS. It worked fine after that.
There was something happening even though one fan still worked perfectly that caused the problems.
One of your three fans might be having issues even if it isn't noticeable.
As a test unplug your fans and tape/zap-strap some 80mm over the shroud holes to maintain decent cooling and test in something moderate that aint gonna stress the ballz off of it. See if it works.
I can't believe it worked for me but that card is 1+ years out of the waste basket now and have not had a single crash or issue.
Like I said unlikely but worth a go if not under warranty. No risk and nothing to loose.
 
Last edited:
I ran FurMark for a few seconds and noticed that ‘PerfCap Reason’ went to Pwr, the TDP also went to +- 250W, 100% which is the limit for this card, i will post a screenshot below.
What happened after a few seconds? Did it crash, or did you just stop it?
 
Back
Top