GeForce RTX 2080 Ti FAILS After Gaming for 2 Hours @ [H]

N4CR

Supreme [H]ardness
Joined
Oct 17, 2011
Messages
4,687
then you don't understand what electromigration is. You are just confused because most retail parts don't have these issues as they have been tested, and revised to avoid such issues. Electromigration can happen at any point of a products lifespan. If the die is flawed from the start you can see failure on first use or hours after.
This. Most hardware failures occurs in the first two weeks, look at bell curves.
 

bill_d

Limp Gawd
Joined
Jun 8, 2007
Messages
193
No, and there is a theory that maybe, custom PCBs are fair better in regards to these issues. It may be the reference PCB design, that has issues, the PCB, or the memory. If it's a custom PCB like on the ASUS STRIX, issues don't seem to occur (to my knowledge) but this is all a theory at this point.

well after a little over week no problem with my Strix yet

knock on wood and everything
 

DrBorg

Gawd
Joined
Jan 22, 2005
Messages
555
ya its very possible if the traces are not of the proper size in some parts. Failures like this are very common on early die revisions. I wonder what the stepping number they are on. A,B,C? most A dies have issues like this, and they are fixed with a major b stepping.

The internal metallization layers are susceptible to electromigration, physical traces are not, except at Extreme current densities; and 100A at 1.4V won't really electromigrate except as a cloud of copper vapor, if you're talking about interposer or PCB traces.

The power traces should be on a whole LAYER, called a Power Plane; they're connected by at least 1 via per pin to the die, vertically thru the PCB.

The internal metallization is, but the current densities on a 100micron deposited metal interconnect is pretty extreme. :)

But this would be a hard fail; like poof, it's gone, and shorting a power supply rail due to metal vapor all over the chip, and volcanoing the chip. (the hole in the chip looks like a volcano; btdt) (I can't find a reference to the term, must be an Older American term.)


This. Most hardware failures occurs in the first two weeks, look at bell curves.

WTF? Hardware failures are Not a Bell curve, it's called a Bathtub Curve, and the lower end is "Infant Mortality", the upper end is "Wearout Phenomenon", which includes electromigration as a cause.

https://en.wikipedia.org/wiki/Bathtub_curve

If that's a Ham license as a name, you need to retake some tests... :)


Electromigration in a new product is referred to by the Good Engineers in the group as "Lightbulb Effect", and you'll be needing a new job after one of those designs hit the field, lol.

Disclaimer: I design Electronics...

The heat that was shown in one of the thermal images show it's the actual Memory parts that are overheating, 90°C is not an operating temp for a memory chip, for very long; memory errors are almost certain at those temps.

I looked up the datasheet; there's some serious bogosity there. Micron is saying the Case temperature is ok at 95°C, but the maximum Storage temperature is 125°C.

If the case is at 95 degrees, and the max is 125, that's BS; there's no way.

The datasheet says there's an ON DIE temperature sensor; anyone got data from one, on a card that's failing?
 

reaper12

2[H]4U
Joined
Oct 21, 2006
Messages
2,519
More likely that the demand is so high, that they are keeping little stock on hand for RMA's, in an attempt to fill orders as fast as they can.

It seems anytime I think to check RTX card stock, all I see is sold out or [Auto-Notify]

Don't know about USA, but in Europe guys who are returning their 2080Ti's are been told that the delays are because of unusually high return rates.
 

Redmud

Weaksauce
Joined
Jan 30, 2007
Messages
111
My understanding is that the EVGA XC (and XC Utlra) cards as well as the MSI Ventus cards use the reference boards. Has anyone found where these are failing?
 

Digital Viper-X-

[H]F Junkie
Joined
Dec 9, 2000
Messages
14,115
then you don't understand what electromigration is. You are just confused because most retail parts don't have these issues as they have been tested, and revised to avoid such issues. Electromigration can happen at any point of a products lifespan. If the die is flawed from the start you can see failure on first use or hours after.

Doesn't electromigration happen over time, as in eventually, as a result of actually running the IC?
 

N4CR

Supreme [H]ardness
Joined
Oct 17, 2011
Messages
4,687
WTF? Hardware failures are Not a Bell curve, it's called a Bathtub Curve, and the lower end is "Infant Mortality", the upper end is "Wearout Phenomenon", which includes electromigration as a cause.

https://en.wikipedia.org/wiki/Bathtub_curve

If that's a Ham license as a name, you need to retake some tests... :)



The datasheet says there's an ON DIE temperature sensor; anyone got data from one, on a card that's failing?

Whoops, similar curve just inverted aka wrong name, I don't do mathsy graph distribution stuff like that much for my area of work. Pretty clear what I meant though, most failures within first two weeks would not be a bell curve now, would it..
Anyway, that's what I was told by a large manufacturer of laptop systems when I worked for them and also handled warranty + insurance claims, I have walked the walk over multiple launches, not just engineered it ;)


Could storage temp be higher because the memory doesn't have to reliably store data at that temp? Is it possible that it can run hotter, but the hotter it runs the more errors it may have?
To me the behaviour looks like what happens when you OC memory too much.
 

DrBorg

Gawd
Joined
Jan 22, 2005
Messages
555
Storage temperature is a function of the Silicon; over The storage temp (125C) it starts rediffusing the N and P type material back into random silicon.

I've ran 2N3055 transistors at 185 degrees C, and after a while, they aren't transistors anymore; at least, they won't turn OFF anymore. :)

That's why the really high power devices for Radio transmitters are still Toobz. :) Water Cooled Tubes, but still vacuum tubes.

The package temp is always lower than the core temp, and I'd love to know what the interior temp is when the outside is at 90C. :D

This video shows WTF 90C I'm talking about:

Also realize: this is the back side of the card, and there's a heat sink on the other side of the die; this is thru the PCB. :O

The max temp is a function of the complexity of the chip; finer architecture is more sensitive.

I'd bet memory is at the bleeding edge of complexity, like microprocessors.


I did find a weird effect, tho: a Flash-based Digital Potentiometer we used was suspected of a problem, so we tested it to failure in an environmental chamber.

It would fail after ~1M write cycles, and fail to overwrite 1's. We found by accident that heatsoaking it at 160C for an hour allowed us to write to it another 1M times, and it could be fixed multiple times.

We never expected that at all.

The guys that made the chip were like, "that voids the warranty, and we don't advise that", lol.


There are Silicon Carbide, Gallium Nitride, and Diamond transistors coming onto the market that are good to 1000°C+. When they make chips out of those, we won't need water blocks, we can use molten Tin cooling loops.

:)
 
Last edited:

Slade

2[H]4U
Joined
Jun 9, 2004
Messages
2,674
Have not seen that on mine. My 2nd screen is only a 60hz model which I keep portrait mode for browsing. nvidia has some weird bugs with multi monitor which may or may not be killing the cards. I know multi monitor runs ~55W idle which is excessive. I also have seen in notes about edge browser had 2d issues in the past, could also be driver related hitting other browsers as well.
 

evilpaul

Limp Gawd
Joined
Dec 31, 2016
Messages
183
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.
 

lostin3d

[H]ard|Gawd
Joined
Oct 13, 2016
Messages
2,043
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.

I could easily be wrong but that sounds like the driver issue that NV recently put a hotfix out for. If so, I wouldn't be surprised if there's still some situations still occurring. What I remember is that it had something to do with 2 monitor, g-sync and non-g-sync combos.
 

schlitzbull

Limp Gawd
Joined
Feb 19, 2014
Messages
433

Going, Going...Gone.

Ugh mine started locking up like this while playing last night. I didn't get any artifacting or crashes, but brief lock ups that definitely seemed different from frame drops. Plus the game I was playing wasn't taxing for the card either. As with my other issues i've had, this was using gamestream to my TV while playing at a locked 60hz. I hope your video isn't a sign of things to come for me. At least I haven't sold my 1080 yet.
 

3dfan

Limp Gawd
Joined
Jun 2, 2016
Messages
169

Going, Going...Gone.

to everyone:

please excuse me if i make a dumb, not too tech pro statement, since i may not be so tech experienced as many user here, or if maybe this was already discussed, i havent read this whole thread, (just searched for words "1709" "1803" "1809" with no results), but this video, and other i recently watched, dont remeber exactly where, where a RTX user is also having similar issues while playing shadow of the tomb rider where the game has a lot of pauses while in game, just like this video, reminds me that i recently did a test on a spare HDD with a clean installed windows 10 pro version 1809, 17763.55 with latest updates and with NV drivers 416.16, 416.34, and even the latest 416.81 with a GTX980TI , i7 4770k @4.2hz, 8GB ram system and i have experienced very similar issues like those videos on rise of the tomb rider testing at full settings no AA@ 1920x1080 dx12 API, where the game has a lot of in game pauses similar to those videos. however, the interesting thing is that the issue does not occur in a clean installation of windows 10 pro version 1709 using same settings, same hardware, same drivers, same game scene.

i have the feeling that in 1809 there may be something broken with VRAM management since those in game pauses are typical issues when a video card runs out of VRAM. it would be interesting to know if there are RTX users on the 1709 version having all those issues which seems related to VRAM (visual artifacting, pauses, etc).

so even if my test was not performed with RTX card, i think it can be worth to share my experience, and since 1809 is known to be a very rushed update, who knows? it could be a potencial culprit of all this RTX mess? in fact a have experienced other gaming issues with 1809 like in evil within 2, which has some random mouse stutters when i play it at 75 fps@75hz, an issue that also does not occur in 1709
 
Last edited:

zehoo

Limp Gawd
Joined
Aug 22, 2004
Messages
379
It's stories like this that make me glad I decided to cheap out and stick with my gtx 1080 while waiting for 7nm parts.
 

Slade

2[H]4U
Joined
Jun 9, 2004
Messages
2,674
Interesting, running 1803 here. So far still works. Makes me hesitant to go Fall update...
 

bill_d

Limp Gawd
Joined
Jun 8, 2007
Messages
193
to everyone:

please excuse me if i make a dumb, not too tech pro statement, since i may not be so tech experienced as many user here, or if maybe this was already discussed, i havent read this whole thread, (just searched for words "1709" "1803" "1809" with no results), but this video, and other i recently watched, dont remeber exactly where, where a RTX user is also having similar issues while playing shadow of the tomb rider where the game has a lot of pauses while in game, just like this video, reminds me that i recently did a test on a spare HDD with a clean installed windows 10 pro version 1809, 17763.55 with latest updates and with NV drivers 416.16, 416.34, and even the latest 416.81 with a GTX980TI , i7 4770k @4.2hz, 8GB ram system and i have experienced very similar issues like those videos on rise of the tomb rider testing at full settings no AA@ 1920x1080 dx12 API, where the game has a lot of in game pauses similar to those videos. however, the interesting thing is that the issue does not occur in a clean installation of windows 10 version pro 1709 version using same settings, same hardware, same drivers, same game scene.

i have the feeling that in 1809 there may be something broken with VRAM management since those in game pauses are typical issues when a video card runs out of VRAM. it would be interesting to know if there are RTX users on the 1709 version having all those issues which seems related to VRAM (visual artifacting, pauses, etc).

so even if my test was not performed with RTX card, i think it can be worth to share my experience, and since 1809 is known to be a very rushed update, who knows? it could be a potencial culprit of all this RTX mess? in fact a have experienced other gaming issues with 1809 like in evil within 2, which has some random mouse stutters when i play it at 75 fps@75hz, an issue that also does not occur in 1709
i'm on 1809 upgraded both systems first day and could not roll back by the time problems showed up
but my 2080 ti Strix is running fine and so is my 1080 ti strix on other system
 

Nytegard

2[H]4U
Joined
Jan 8, 2004
Messages
3,326
I'm getting those stutters too in Call of Duty when I have SLI enabled. The game might graphically freeze for 10-20 seconds at a time before proceeding to normal. Removing SLI they went away, but I'm hoping this is just a driver problem, as I haven't seen any other issues.
 

FrgMstr

Just Plain Mean
Staff member
Joined
May 18, 1997
Messages
51,122
I ordered a thermal imaging camera to look into this a little closer.

Also, I talked to a couple of people that actually BUILD video cards and I got the same feedback as to what I expected.....memory issues. And yes the card does have Micron on it.
 

Mchart

2[H]4U
Joined
Aug 7, 2004
Messages
4,082
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.

The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.

Multi monitor is reported as causing problems and I don't believe the latest update fixed it at all. The only issue fixed is the g-sync related BSOD currently.
 

cjcox

[H]ard|Gawd
Joined
Jun 7, 2004
Messages
1,795
But it was the greatest 2 hours of gaming of my life. Looking forward to Nvidia365 where we can rent our cards.
 

Slade

2[H]4U
Joined
Jun 9, 2004
Messages
2,674
Guess bad batch micron is high on the suspect list or a cooling implementation failure.
 

Gripen90

n00b
Joined
Aug 21, 2012
Messages
30
My Gainward RTX 2080Ti Phoenix Golden Sample finally got returned for RMA, it should arrive at the retailer tomorrow and I am curious if they'll just refund me. They have said that because of the many RTX 2080Ti issues they are currently not restocking those models.

Also Nvidia are changing the prefix on the cards now.
713b5d.png
 
Top