Gigabyte RTX 2080 Super crashing under load

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
hi everyone,



I own a Gigabyte RTX 2080 Super 8G and since last year it has been crashing everytime i open a game or any 3D application. My screen goes black and gpu fans would go at full speed. I can’t shutdown my pc because that so the only way to turn it off and on again is by pressing the power button. I tried the gpu in different rigs i have but no luck, i also cleaned the gpu and replaced the thermal pads and thermal paste but that didn’t help either. The only solution i found is to undervolt the gpu to 0.900v (900mv) at 1820 mhz. The normal voltage for this card is 1.050v. This gpu never had overheating problems or anything so im not sure what the problem is. I heard that defective vrms are common in RTX 2070/2080/2070S/2080S gpus but I don’t really know how to find a defective VRM because all GPU vcore VRMS are connected together i believe, so are the memory VRMS. (tell me if I’m wrong) i found 3 vrms on the left side of the pcb and they all have a different voltage reading, I’m not sure if that is normal but i will post a picture of the pcb with the details.
Any help would be appreciated.

Kind Regards,

Jason
 

Attachments

  • ACF7C55E-8B38-4E24-BE10-C8AD5DAD57CD.jpeg
    ACF7C55E-8B38-4E24-BE10-C8AD5DAD57CD.jpeg
    951.3 KB · Views: 1

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
When you reduce the core voltage, does it work normally, other than being just a little slower?
 

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
I
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
900mv was the stable voltage i could get, going higher then that will give me a black screen and the fans run at full speed. If i load a game stock voltage it will crash after 0-10 seconds, I actually didnt try to measure the vrm output voltage when the gpu crashes, would be a good idea. I will give it a try tomorrow and let you know because it’s pretty late here.
 
Last edited:

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
What happens if you increase the voltage a little? Does it still crash?

Also, when it crashes, are we talking immediately when you start a game - no frames rendered at all, or does it run for some period of time (seconds? Minutes?) and then crash?

There are a few possibilities here; one is some sort of BGA or silicon failure, but the fact that it works if you undervolt it suggests it could be something else. If you probe the output from the core VRM after it crashes, what do you get?
hi, i checked the voltage from the vcore and memory vrms when it crashes and i didn’t get any voltage readings.
 

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
That's probably good news, in the sense that it means the core and memory controller IC is most likely shutting down due to one of its various built-in protection functions.

I believe the controller on the 2080 is a UP9511 or UP9512. Datasheet for the 9511 linked below.
http://www.icware.ru/pdf/0004239.pdf

If you read page 11, you'll see that it has several protection modes that can cause it to shut down. My guess would be that it's shutting down due to over voltage (which it may be measuring incorrectly), but it's also possible that it's shorted internally, and the current leakage inside it is heating it up, and causing it to shut down prematurely.

Try this: Unplug the OS drive and power the system up and let it sit there at the boot screen where it complains that there's no OS drive. Measure the core and memory voltage and report back.

As I recall, you should be looking for about 1.35V on the memory rail, and 1.063V on the core rail. If you have something way off from this, then the first thing to check is the tiny resistors around the UP9511. There is a specific value that they need to be in order to properly calibrate the UP9511's current and voltage sensing, and if they're off, then its behavior won't be correct. Maybe check for corrosion in that area?
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
That's probably good news, in the sense that it means the core and memory controller IC is most likely shutting down due to one of its various built-in protection functions.

I believe the controller on the 2080 is a UP9511 or UP9512. Datasheet for the 9511 linked below.
http://www.icware.ru/pdf/0004239.pdf

If you read page 11, you'll see that it has several protection modes that can cause it to shut down. My guess would be that it's shutting down due to over voltage (which it may be measuring incorrectly), but it's also possible that it's shorted internally, and the current leakage inside it is heating it up, and causing it to shut down prematurely.

Try this: Unplug the OS drive and power the system up and let it sit there at the boot screen where it complains that there's no OS drive. Measure the core and memory voltage and report back.

As I recall, you should be looking for about 1.35V on the memory rail, and 1.063V on the core rail. If you have something way off from this, then the first thing to check is the tiny resistors around the UP9511. There is a specific value that they need to be in order to properly calibrate the UP9511's current and voltage sensing, and if they're off, then its behavior won't be correct. Maybe check for corrosion in that area?
I unplugged my drive and booted up the pc, i got 760mv from vcore and 1355mv from memory. I also took a picture of where the UP9512R is located. I didn’t see any corrosion or anything.
 

Attachments

  • B1DCB43B-67B6-4F80-B981-E2660A55E180.jpeg
    B1DCB43B-67B6-4F80-B981-E2660A55E180.jpeg
    782.2 KB · Views: 0

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
I unplugged my drive and booted up the pc, i got 760mv from vcore and 1355mv from memory. I also took a picture of where the UP9512R is located. I didn’t see any corrosion or anything.
That sounds like it's probably normal. There are a few things you can try next, but one thing I might try is to let it sit there at the post screen for a minute or two, and then feel the back of the board for any obvious hot spots. If you find any, troubleshoot that phase.

The power stages almost certainly have a thermal protection feature of their own, and I remember one card that I fixed once where one of the phases' bootstrap capacitors was bad, causing the FETs on that phase to run craaaaaazy hot, because the gate voltage was lower than expected. Also check behind the UP9512. That has a thermal protection feature too.
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
That sounds like it's probably normal. There are a few things you can try next, but one thing I might try is to let it sit there at the post screen for a minute or two, and then feel the back of the board for any obvious hot spots. If you find any, troubleshoot that phase.

The power stages almost certainly have a thermal protection feature of their own, and I remember one card that I fixed once where one of the phases' bootstrap capacitors was bad, causing the FETs on that phase to run craaaaaazy hot, because the gate voltage was lower than expected. Also check behind the UP9512. That has a thermal protection feature too.
i will take a look tomorrow and let you know if i find any hot spots. Also, where are the bootstrap capacitors located exactly?
 

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
The bootstrap capacitors should be located very close to their respective VRM power stage ICs. You'll have to look up the datasheets and figure out which ones they are yourself, but I may be able to help you if can you share a closeup photo of the power stages, or read off the markings and post them.
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
The bootstrap capacitors should be located very close to their respective VRM power stage ICs. You'll have to look up the datasheets and figure out which ones they are yourself, but I may be able to help you if can you share a closeup photo of the power stages, or read off the markings and post them.
i booted my pc and touched the backside of the core vrms and didn’t really feel abnormal temperatures, same for the UP9512R IC. I also took a picture of the VRMS that is being used. It’s the SIC788A
 

Attachments

  • 376F6D27-D715-4C16-9EDB-EB5617BCF40C.jpeg
    376F6D27-D715-4C16-9EDB-EB5617BCF40C.jpeg
    795.3 KB · Views: 0

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
I will take a look at it, I’m not sure how to locate the current sense resistor or other components by looking up the datasheet of the controller/vrms, its new for me .
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
According to the datasheet for the SIC788A, the bootstrap capacitors are connected to pin 4.
https://www.vishay.com/docs/62985/sic788a.pdf

If none of them are obviously running hotter than the others, that's probably not it, though. Bad current sense resistor is another (remote) possibility, I suppose. Check the datasheets for the controller and power stages and pay attention to the overcurrent and overtemperature protection features they have.
I used my microscope to find pin 4 from the core vrm and noticed that the connection goes to the other side of the pcb (i think). But I can’t seem to find a capacitor connected to it. Its difficult to tell. I will upload a image of the back side from where the vrm is located and also the side from the vrm itself.
 

Attachments

  • DF7EB88D-A3F5-4880-89C1-CE04242D9A03.jpeg
    DF7EB88D-A3F5-4880-89C1-CE04242D9A03.jpeg
    436.4 KB · Views: 0
  • FC8039B1-C7E0-47E4-B17D-3C6A19B32EED.jpeg
    FC8039B1-C7E0-47E4-B17D-3C6A19B32EED.jpeg
    345.6 KB · Views: 1

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,498
Pin 4 is on the other edge visible in your photo - the edge perpendicular to the markings. It looks like it also connects directly to a through hole, but the cap usually very close. I'd look for any MLC caps directly behind that IC, and use a multimeter to figure out which ones are connected to that pin.

Something else that occurred to me that might help narrow this down - if you look at the board power reading in GPU-Z while the card is working, what do you get? If you're undervolting, you should be well under 100% of the power limit. I don't remember what that is in watts, but it should be pretty low. Check and see if the numbers you get there look sane, in watts and in and percentage.
If you have higher reported power consumption than is possible at the given voltage, then that suggests a problem with current sensing somewhere on the board.
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
Pin 4 is on the other edge visible in your photo - the edge perpendicular to the markings. It looks like it also connects directly to a through hole, but the cap usually very close. I'd look for any MLC caps directly behind that IC, and use a multimeter to figure out which ones are connected to that pin.

Something else that occurred to me that might help narrow this down - if you look at the board power reading in GPU-Z while the card is working, what do you get? If you're undervolting, you should be well under 100% of the power limit. I don't remember what that is in watts, but it should be pretty low. Check and see if the numbers you get there look sane, in watts and in and percentage.
If you have higher reported power consumption than is possible at the given voltage, then that suggests a problem with current sensing somewhere on the board.
Do you mean on the second picture i send? I used GPU-Z to monitor gpu power etc while running FireStrike on 3DMark and i noticed that it says Idle on ‘Perfcap Reason’ when the GPU was under load which is weird to me.
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
What happens if you run FurMark?
I ran FurMark for a few seconds and noticed that ‘PerfCap Reason’ went to Pwr, the TDP also went to +- 250W, 100% which is the limit for this card, i will post a screenshot below.
 

Attachments

  • 7FD6D0F0-B63A-493D-BB75-9E12555ED331.png
    7FD6D0F0-B63A-493D-BB75-9E12555ED331.png
    54 KB · Views: 1

Shelf24

n00b
Joined
Mar 8, 2023
Messages
1
Hello,

I have the same problem with the same Gigabyte GPU, did you find any other explanation?

Best regards,

Sheldon
 

Jason2000

n00b
Joined
Mar 17, 2021
Messages
19
Hello,

I have the same problem with the same Gigabyte GPU, did you find any other explanation?

Best regards,

Sheldon
Nope.. I’m suspecting that the VRMS of the gpu core might be the issue.. i will order new ones and replace them, i know this is a common issue with RTX 2070/2070S/2080/2080S gpus so i will give it a try.
 

rcarlos

n00b
Joined
Oct 7, 2016
Messages
15
Check if you still have warranty coverage, Gigabyte should have 3-year warranty on GPUs.
 
Top