NVIDIA on the Cause of RTX 2080 Series Card Failures

So the big questions now are:
* Have the QC procedures been fixed?
* When will cards that have been made post-fix enter the market?
* Will the post-fix cards be identifiable to consumers, via new model number, box labeling, etc?
* Will the cards that were made during the problematic QC process be recalled, or will they be left in stock for customers to roll the dice with?

I know one thing for sure, I'm definitely taking a pass on this first RTX generation.
 
I am glad as i got the Titan Xp when it came out and its been flawless watercooled. Id be pissed if id imported a doa card from NVIDIA for sure.
 
Considering the size and complexity of the chips, I think the odds of having bad ones fall through the cracks is high. I think the EVGA fire failure is a failed manufacturing issue given the specific spot where the board fried.
 
I know you know what passive cooling is so I'm not going to insult your intelligence. Your card is being air cooled. I was responding to a card being passively cooled.

I think there might be some confusion depending on the card also. As far as I am concerned, if the VRM has no heatsink contact and is only cooled by the exhausted air of the cooler as a byproduct then it is still being passively cooled being that it doesnt have a dedicated cooling.
 
The test escape explanation, seems to me would only apply to one vendor. So nvidia's own branded cards in this case. I wouldn't expect it to affect ASUS, EVGA, or any other brand unless they are made in the same factory.

But they are owning up to it, gotta hand it to them for that.
 
Nvidia: "Unless you were using the G-OS and browsing the web with G-scape Navigator using a G-SYNC monitor while downloading a GameWorks game with an Nvidia Shield in the room, your warranty on the RTX series is null and void."

In all seriousness, though, I'm not looking forward to hearing about some family dying in a fire because someone left their gaming rig idle over night with an RTX series hooked up. Something is definitely up with these cards, too many problems. It doesn't pass the sniff test.
 
Hey Kyle, quick question. If I am understanding the NVIDIA rep correctly, it comes across like a fancy pants way of saying "We had shit QC" is the cause, but you said you suspected that they were lying. Are you under the impression that the 2080ti is just straight up poorly designed, or am I misunderstanding what the rep was saying?
 
Man, I hate nVidia with a passion, but even I was willing to give them some applause for this... Yet to come here and see everyone so suspicious... leaves me wondering what I missed. I'd have been all over jumping on this bandwagon, but I felt it was a fairly legitimate reason. That's on TOP of them seeming to, as said, own up to this.

These cards seem to be working ....... Then deteriorating and failing.

I don't buy that answer. What they said would align more with DOA cards
See, I don't see it that way... A part not in spec doesn't mean it's DOA or that it would instantly fail... Not in spec means it still is a functioning part. Look at it this way, you buy a lower end CPU and overclock it to the specs of a higher performance model. If it was capable of running at those speeds, it likely would've been. NOW, yes, there are instances like with Ryzen, where pretty much every die is a good one and wouldn't need to be down-binned to produce lower cost models, but it still happens more often than not. In this instance, your CPU is going to do fine, and may do fine for years this way. That wasn't always the case though, and overclocking did have ramifications that would kill a chip quicker.

Here's another great example. Once upon a time, a capacitor company thought it had bought on the black market, the secret recipe to the electrolytic solution in a high-end competing company's capacitors... This company went on to produce god knows how many capacitors using this recipe, thinking they were going to be rolling in the cash as well. Not too long down the line, products that features these caps started to fail prematurely. They ran fine up until then, but eventually the electrolyte boiled off and (presumably) lead to thermal runaway, causing them to burst or leak.

Same situation to here, as I see it. Resistors are going to need a specific thickness of metal film, no doubt having a specific metallurgical %, otherwise it wouldn't have the correct resistance or may degrade over time and drift from its intended value. Increased resistance will mean increased heat under load, and when it gets hotter, it degrades faster, exacerbating things. Pair that with a part that isn't support to get hot, and as such not having any kind of cooling... and as Human Torch would say: "Flame on!"

Even if it isn't that extreme, it could be a situation with it providing too much/little resistance in the circuit for power delivery, in the configuration sense only, resulting in the provided voltage being too low/high. Perhaps that power circuit is feeding the memory and in turn you now have like what Kyle experienced since the memory was not getting the right amount of voltage, and caused artifacts+[H]ard locking the system.


Maybe it's the R005 resistors like what was right in the spot that caught on fire. Maybe Kyle's card has one of those somewhere in the memory circuit.
We may never know the real reason, unfortunately. I still give some kudos to nVidia for their admission, though.
 
I work in manufacturing. I'm going to leave it at the tinfoil hattery isn't unfounded.

Don't their parts vendors take care of all of that? They're supposedly equipped to do a very good job of it, and would want to. Could NVidia be buying cheapo parts? They'd have no motivation to do so on high-end boards, certainly.

I wish I could share your idealism. And quite the contrary, cheap parts + don't pass the savings onto the consumer = more profit.
 
All you 'tech jesus said' types might want to re-evaluate parroting his views without digging every time he says something contrary to things that people who have been around lot longer know.
'All RMAs are fine, nothing to see here'. You almost never, ever hear of that being admitted before a public statement is made (e.g. Nvidia). Even then it'll be well hidden if at all, especially if not vendor specific.

Calling it test escapes could be any component on the board that went bad from the cheapest part to the core. That's not letting the public know what was wrong with the cards and what steps they are taking to improve quality. People want to know so that they can feel confident in purchasing NVIDIA products and know that it will not start a literal fire.
I would surmise that the test escape was Micron ram or directly related to controlling that ram chip and they have had to nuke the entire stock of those parts while they test it, hence the switch to Samsung, they don't want to burn Micron publicly for contract or supply relationship reasons hence keeping it quiet while they figure things out.
The particular artifacting really looks ram related, anyone who has pushed things too much has seen similar behaviour before.
Also keep in mind the high temperatures people have measured, we could be seeing a part out of tolerance and running too hot, impacting ram temp or stability and causing errors.

Who's cooling a reference RTX 2080 series card passively under load?
My next build will be a 100% passive HEDT with flagship GPU.
 
Last edited:
With silicon this large they undoubtedly pushed it out the door with defects just to fill pre orders. The yields on something this massive cant be stellar. Figuring they would take care of it in the rma process.
 
Yeah, the teardown showed that the reference build could support fairly insane power levels if you kept the card cooled, and gamersnexus has shown that even the somewhat limited FE blower keeps all the parts (even on the cards that are failing) well within acceptable temp ranges.

Whoever nvidia contracted out to build these things fucked up big time with the QC.

I thought NVidia did their own board design/builds now? that's how we got the FE version of the cards.

But I have a feeling this is down to the fact that NVidia was relying strictly on a highend necessity component board in order to work properly. Having overly complicated design like this will lead to mistakes happening and that's where we are seeing boards fail. I'm not 100% sold on the "bad" memory modules that was being rumored earlier but it does make sense with just some of the bad cards.
 
I'm leaning to bad memory batch that can't run at the spec requested. It seems the most likely given the sporadic nature of the problems. The [H] crew is sensitive to these issues because we play in the HEDT market and push our gear harder than most consumers do. nvidia can "afford" to take a little pr hit, while micron releasing bad pieces may cause larger ripples across the industry.
 
this is 100% we rushed our product out the door without doing their due diligence in testing environment. It has now come back to bite them in the bottom end and line. This is not the first time NVidia has screwed people over (Mobile 8X00 parts ring a bell?)
 
And for the fellow whose 2080ti board caught fire and is worried about an RMA: I'd think EVGA would be sending a courier to his house ASAP to get it for analysis, and dropping off a replacement while there. And a refund.
They should replace no doubt and any other damages done but doesn't warrant a refund. While rare high powered electronics do catch fire. I had a MSI 780 catch fire years ago. I opened a rma ticket and got a new card in a week over Thanksgiving.
 
Looking at the PCB photos linked in the threads, this looks like the vias overloaded when the power supply section here failed.

There are 10 vias, looking like ~10 mil vias, so those are likely good for about 1.2A each, so 12A total.

This would raise the temperature to 42C, if it was alone, but power adds, so Wow, no wonder it burned.

I'm not printing that number until I check it.

This power supply is going to be the issue, if it falls, the chips overheat, and it goes downhill fast.

This power supply is the Vboost for the Memory chips, and if this voltage is too low, the chips will run hot, and draw more power.


Nvidia says capacitor failures, but I don't see anywhere Near enough caps for this power level.

This is based on a similar design on the Titan, and it has 4 large caps right beside this part, and only has one PS section; this has two. And No Big Caps.

There seems to be two empty Capacitor pads right beside the L64 inductor label, those are pretty important. :)

There don't seem to be ANY large caps at all, all are 0201, or o603 at most.


I can't believe there's no listing on how much power this memory chip draws; I've never seen a datasheet without it.

There are several references to low power, and lower voltage, but 1.25V at 10A is 12.5W, as is 1.5V at 8.33A, so lower voltage doesn't mean lower power.

This board needed twice the power level of the Titan, after all; there's two power sections.


Look at your boards; if all those cap pads are empty, there will be a recall.

Here are good pix:

https://xdevs.com/guide/evga_2080tixc/

Anyone want to measure across theR005 resistor for me, while it's running? :D That's a 0.005 ohm resistor, and is for current sensing for the controller chip...
 
They should offer a lifetime warranty, with no proof of purchase required. Any issue? Send it in (free), and they'll send a new one to you. ("Lifetime" should mean exactly that: your lifetime.)

No issue? Send it in, and they'll send a new one to you.

And some swag.

Let's see what they do...
 
Call my cynical, and look into my tinfoil hat for a second.

I wouldn't put it past nVidia to try to use this as a stunt to get back some of the credibility they lost over the GPP thing. Send out a knowingly bad batch of parts, admit there was an oversight, offer quick and painless replacements. You'll probably get mixed results, but the overall should work in favor of "look they're doing the right thing".

There's my biased theory for the day.
 
Obviously these cards were NOT ready for prime time. The great NVidia F**ked up royal. 1200 dollar cards failing that's just pure BS..... Wow Amd can't do drivers and Nvidia can't do hardware.. ha ha ha ha ha......
 
Very interesting. Hopefully mine holds up fine. I've run it through plenty AIDA GPU-tests with no issues and it doesn't feel particularly warm.
 
You know, I keep hearing how AMD can't do drivers. Ever since I got my Vega 56 I've been able to play 3 games on steam that wouldn't launch with my 1060 and also I can watch hockey highlights on nhl.com in full screen (using edge, I know, I know) when I couldn't with Nvidia.
 
Call my cynical, and look into my tinfoil hat for a second.

I wouldn't put it past nVidia to try to use this as a stunt to get back some of the credibility they lost over the GPP thing. Send out a knowingly bad batch of parts, admit there was an oversight, offer quick and painless replacements. You'll probably get mixed results, but the overall should work in favor of "look they're doing the right thing".

There's my biased theory for the day.
Put down the crack pipe and back away from the keyboard.
 
Its funny, companies always blame it on "Software/Computer Glitches".. Why dont they just say "we screwed up" and move on......YOu cant change the past, only the future.
 
Test escapes can also refer to the GPU chip testing. There's an extensive set of test vectors that get run on an ASIC during the production process. There might be defects on the chip that don't get caught with current testing, then you'd have to add more test vectors to cover that. I've been retired for 13 years now (used to design ASIC's and tests for them), so I'm probably behind the times on current tools, but I seem to remember that there's software tools that can detect parts of the chip that don't get covered with a set of tests. I'm sure nVidia is doing that and ensuring coverage. Maybe a particular test wasn't being run for some reason, or it was a strange timing issue. It would be nice to eventually know the true bottom line, but I doubt we'll get that unless someone has some inside connections at nvidia or the various board sellers/manufacturers.
 
Really Nvidia … They would never admit that the PCB itself is simply cheap and not well designed for the size power and heat output of the GPU. "Escapes" hm, easier to blame the would be cheap components...:)
 
We found that people that are having the most problems are putting these cards under too much stress. People using the cards for general productivity apps, web browsing, etc., will likely not see any problems.
 
We found that people that are having the most problems are putting these cards under too much stress. People using the cards for general productivity apps, web browsing, etc., will likely not see any problems.

Um... Yeah, just like the one that burst into flame while simply surfing web pages...

Kyle, longtime reader here. Love this site, best source of unbiased reviews on the planet.
 
Back
Top