GeForce RTX 2080 Ti FAILS After Gaming for 2 Hours @ [H]

then you don't understand what electromigration is. You are just confused because most retail parts don't have these issues as they have been tested, and revised to avoid such issues. Electromigration can happen at any point of a products lifespan. If the die is flawed from the start you can see failure on first use or hours after.
This. Most hardware failures occurs in the first two weeks, look at bell curves.
 
No, and there is a theory that maybe, custom PCBs are fair better in regards to these issues. It may be the reference PCB design, that has issues, the PCB, or the memory. If it's a custom PCB like on the ASUS STRIX, issues don't seem to occur (to my knowledge) but this is all a theory at this point.

well after a little over week no problem with my Strix yet

knock on wood and everything
 
ya its very possible if the traces are not of the proper size in some parts. Failures like this are very common on early die revisions. I wonder what the stepping number they are on. A,B,C? most A dies have issues like this, and they are fixed with a major b stepping.

The internal metallization layers are susceptible to electromigration, physical traces are not, except at Extreme current densities; and 100A at 1.4V won't really electromigrate except as a cloud of copper vapor, if you're talking about interposer or PCB traces.

The power traces should be on a whole LAYER, called a Power Plane; they're connected by at least 1 via per pin to the die, vertically thru the PCB.

The internal metallization is, but the current densities on a 100micron deposited metal interconnect is pretty extreme. :)

But this would be a hard fail; like poof, it's gone, and shorting a power supply rail due to metal vapor all over the chip, and volcanoing the chip. (the hole in the chip looks like a volcano; btdt) (I can't find a reference to the term, must be an Older American term.)


This. Most hardware failures occurs in the first two weeks, look at bell curves.

WTF? Hardware failures are Not a Bell curve, it's called a Bathtub Curve, and the lower end is "Infant Mortality", the upper end is "Wearout Phenomenon", which includes electromigration as a cause.

https://en.wikipedia.org/wiki/Bathtub_curve

If that's a Ham license as a name, you need to retake some tests... :)


Electromigration in a new product is referred to by the Good Engineers in the group as "Lightbulb Effect", and you'll be needing a new job after one of those designs hit the field, lol.

Disclaimer: I design Electronics...

The heat that was shown in one of the thermal images show it's the actual Memory parts that are overheating, 90°C is not an operating temp for a memory chip, for very long; memory errors are almost certain at those temps.

I looked up the datasheet; there's some serious bogosity there. Micron is saying the Case temperature is ok at 95°C, but the maximum Storage temperature is 125°C.

If the case is at 95 degrees, and the max is 125, that's BS; there's no way.

The datasheet says there's an ON DIE temperature sensor; anyone got data from one, on a card that's failing?
 
More likely that the demand is so high, that they are keeping little stock on hand for RMA's, in an attempt to fill orders as fast as they can.

It seems anytime I think to check RTX card stock, all I see is sold out or [Auto-Notify]

Don't know about USA, but in Europe guys who are returning their 2080Ti's are been told that the delays are because of unusually high return rates.
 
My understanding is that the EVGA XC (and XC Utlra) cards as well as the MSI Ventus cards use the reference boards. Has anyone found where these are failing?
 
then you don't understand what electromigration is. You are just confused because most retail parts don't have these issues as they have been tested, and revised to avoid such issues. Electromigration can happen at any point of a products lifespan. If the die is flawed from the start you can see failure on first use or hours after.

Doesn't electromigration happen over time, as in eventually, as a result of actually running the IC?
 
WTF? Hardware failures are Not a Bell curve, it's called a Bathtub Curve, and the lower end is "Infant Mortality", the upper end is "Wearout Phenomenon", which includes electromigration as a cause.

https://en.wikipedia.org/wiki/Bathtub_curve

If that's a Ham license as a name, you need to retake some tests... :)



The datasheet says there's an ON DIE temperature sensor; anyone got data from one, on a card that's failing?

Whoops, similar curve just inverted aka wrong name, I don't do mathsy graph distribution stuff like that much for my area of work. Pretty clear what I meant though, most failures within first two weeks would not be a bell curve now, would it..
Anyway, that's what I was told by a large manufacturer of laptop systems when I worked for them and also handled warranty + insurance claims, I have walked the walk over multiple launches, not just engineered it ;)


Could storage temp be higher because the memory doesn't have to reliably store data at that temp? Is it possible that it can run hotter, but the hotter it runs the more errors it may have?
To me the behaviour looks like what happens when you OC memory too much.
 
Storage temperature is a function of the Silicon; over The storage temp (125C) it starts rediffusing the N and P type material back into random silicon.

I've ran 2N3055 transistors at 185 degrees C, and after a while, they aren't transistors anymore; at least, they won't turn OFF anymore. :)

That's why the really high power devices for Radio transmitters are still Toobz. :) Water Cooled Tubes, but still vacuum tubes.

The package temp is always lower than the core temp, and I'd love to know what the interior temp is when the outside is at 90C. :D

This video shows WTF 90C I'm talking about:


Also realize: this is the back side of the card, and there's a heat sink on the other side of the die; this is thru the PCB. :O

The max temp is a function of the complexity of the chip; finer architecture is more sensitive.

I'd bet memory is at the bleeding edge of complexity, like microprocessors.


I did find a weird effect, tho: a Flash-based Digital Potentiometer we used was suspected of a problem, so we tested it to failure in an environmental chamber.

It would fail after ~1M write cycles, and fail to overwrite 1's. We found by accident that heatsoaking it at 160C for an hour allowed us to write to it another 1M times, and it could be fixed multiple times.

We never expected that at all.

The guys that made the chip were like, "that voids the warranty, and we don't advise that", lol.


There are Silicon Carbide, Gallium Nitride, and Diamond transistors coming onto the market that are good to 1000°C+. When they make chips out of those, we won't need water blocks, we can use molten Tin cooling loops.

:)
 
Last edited:
Have not seen that on mine. My 2nd screen is only a 60hz model which I keep portrait mode for browsing. nvidia has some weird bugs with multi monitor which may or may not be killing the cards. I know multi monitor runs ~55W idle which is excessive. I also have seen in notes about edge browser had 2d issues in the past, could also be driver related hitting other browsers as well.
 
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.
 
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.

I could easily be wrong but that sounds like the driver issue that NV recently put a hotfix out for. If so, I wouldn't be surprised if there's still some situations still occurring. What I remember is that it had something to do with 2 monitor, g-sync and non-g-sync combos.
 


Going, Going...Gone.


Ugh mine started locking up like this while playing last night. I didn't get any artifacting or crashes, but brief lock ups that definitely seemed different from frame drops. Plus the game I was playing wasn't taxing for the card either. As with my other issues i've had, this was using gamestream to my TV while playing at a locked 60hz. I hope your video isn't a sign of things to come for me. At least I haven't sold my 1080 yet.
 


Going, Going...Gone.


to everyone:

please excuse me if i make a dumb, not too tech pro statement, since i may not be so tech experienced as many user here, or if maybe this was already discussed, i havent read this whole thread, (just searched for words "1709" "1803" "1809" with no results), but this video, and other i recently watched, dont remeber exactly where, where a RTX user is also having similar issues while playing shadow of the tomb rider where the game has a lot of pauses while in game, just like this video, reminds me that i recently did a test on a spare HDD with a clean installed windows 10 pro version 1809, 17763.55 with latest updates and with NV drivers 416.16, 416.34, and even the latest 416.81 with a GTX980TI , i7 4770k @4.2hz, 8GB ram system and i have experienced very similar issues like those videos on rise of the tomb rider testing at full settings no AA@ 1920x1080 dx12 API, where the game has a lot of in game pauses similar to those videos. however, the interesting thing is that the issue does not occur in a clean installation of windows 10 pro version 1709 using same settings, same hardware, same drivers, same game scene.

i have the feeling that in 1809 there may be something broken with VRAM management since those in game pauses are typical issues when a video card runs out of VRAM. it would be interesting to know if there are RTX users on the 1709 version having all those issues which seems related to VRAM (visual artifacting, pauses, etc).

so even if my test was not performed with RTX card, i think it can be worth to share my experience, and since 1809 is known to be a very rushed update, who knows? it could be a potencial culprit of all this RTX mess? in fact a have experienced other gaming issues with 1809 like in evil within 2, which has some random mouse stutters when i play it at 75 fps@75hz, an issue that also does not occur in 1709
 
Last edited:
It's stories like this that make me glad I decided to cheap out and stick with my gtx 1080 while waiting for 7nm parts.
 
Interesting, running 1803 here. So far still works. Makes me hesitant to go Fall update...
 
to everyone:

please excuse me if i make a dumb, not too tech pro statement, since i may not be so tech experienced as many user here, or if maybe this was already discussed, i havent read this whole thread, (just searched for words "1709" "1803" "1809" with no results), but this video, and other i recently watched, dont remeber exactly where, where a RTX user is also having similar issues while playing shadow of the tomb rider where the game has a lot of pauses while in game, just like this video, reminds me that i recently did a test on a spare HDD with a clean installed windows 10 pro version 1809, 17763.55 with latest updates and with NV drivers 416.16, 416.34, and even the latest 416.81 with a GTX980TI , i7 4770k @4.2hz, 8GB ram system and i have experienced very similar issues like those videos on rise of the tomb rider testing at full settings no AA@ 1920x1080 dx12 API, where the game has a lot of in game pauses similar to those videos. however, the interesting thing is that the issue does not occur in a clean installation of windows 10 version pro 1709 version using same settings, same hardware, same drivers, same game scene.

i have the feeling that in 1809 there may be something broken with VRAM management since those in game pauses are typical issues when a video card runs out of VRAM. it would be interesting to know if there are RTX users on the 1709 version having all those issues which seems related to VRAM (visual artifacting, pauses, etc).

so even if my test was not performed with RTX card, i think it can be worth to share my experience, and since 1809 is known to be a very rushed update, who knows? it could be a potencial culprit of all this RTX mess? in fact a have experienced other gaming issues with 1809 like in evil within 2, which has some random mouse stutters when i play it at 75 fps@75hz, an issue that also does not occur in 1709
i'm on 1809 upgraded both systems first day and could not roll back by the time problems showed up
but my 2080 ti Strix is running fine and so is my 1080 ti strix on other system
 
I'm getting those stutters too in Call of Duty when I have SLI enabled. The game might graphically freeze for 10-20 seconds at a time before proceeding to normal. Removing SLI they went away, but I'm hoping this is just a driver problem, as I haven't seen any other issues.
 
I ordered a thermal imaging camera to look into this a little closer.

Also, I talked to a couple of people that actually BUILD video cards and I got the same feedback as to what I expected.....memory issues. And yes the card does have Micron on it.
 
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.

Multi monitor is reported as causing problems and I don't believe the latest update fixed it at all. The only issue fixed is the g-sync related BSOD currently.
 
But it was the greatest 2 hours of gaming of my life. Looking forward to Nvidia365 where we can rent our cards.
 
Guess bad batch micron is high on the suspect list or a cooling implementation failure.
 
My Gainward RTX 2080Ti Phoenix Golden Sample finally got returned for RMA, it should arrive at the retailer tomorrow and I am curious if they'll just refund me. They have said that because of the many RTX 2080Ti issues they are currently not restocking those models.

Also Nvidia are changing the prefix on the cards now.
713b5d.png
 
Back
Top