Old RTX 2080 Ti FE Meets Replacement RTX 2080 Ti FE

FrgMstr

Just Plain Mean
Staff member
Joined
May 18, 1997
Messages
55,510
After our first RTX 2080 Ti Founders Edition went into Space Invaders mode, after a bit of testing, we put in with NVIDIA for an RMA since we did purchase the card. We got our replacement card in on Monday. Our "new" RTX 2080 Ti FE card has been equipped with Samsung GDDR6 instead of Micron. Luckily, we bought two 2080 Ti FE cards, and still have one here with Micron VRAM to test side by side with the new Samsung VRAM card.

GPUz Side by Side.


This VRAM change has of course been reported elsewhere, but we assuredly can confirm this firsthand finally. The $64K question is of course, does this point to the issue with all the failing 2080 Ti FE cards being VRAM associated. Space Invaders points to yes quite possibly, as it looks to very much be a VRAM failure, but according to NVIDIA this is just a "Test Escape." We are still digging.

We are however sure that the Micron VRAM is running at a very toasty 86C under a very normal workload on an open test bench this evening.

FLIR Micron Memory Temp.
 
Last edited:
It may be posted elsewhere, if it is I apologize I haven't seen it.

Any chance we could get a thermal shot of the Samsung vs Micron under similar conditions? I know you would have to pull the card and re-establish the test case, it's purely a curiosity thing.
 
Man if the frist batch of cards shipped with bad ram what a disaster that has to be for Nvidia
Also how long does it take to change the ram out on cards? Kinda makes you wornder if they knew before the release date that they had a potential problem. I imagine it takes at least a few months to find and make board-level changes and get them shipping.
 
Man if the frist batch of cards shipped with bad ram what a disaster that has to be for Nvidia
Also how long does it take to change the ram out on cards? Kinda makes you wornder if they knew before the release date that they had a potential problem. I imagine it takes at least a few months to find and make board-level changes and get them shipping.

In theory, the VRAM module should just be a drop in replacement; NIVIDIA has almost certainly standardized the on-board packaging for exactly this reason.
 
This isn't the first time Micron VRAM has been suspect on an Nvidia product.

Anyone remember the 1070 overclocking disappointments with people who had Micron equipped cards? Yes, it's overclocking and there weren't any failures at stock speeds, but still, it seems to indicate a weak link; the Samsung equipped cards consistently did much better.

https://hardforum.com/threads/nvidia-gtx-1070-vram-lottery-micron-or-samsung.1908758/
 
I have a Gigabyte Windforce 2080ti. Unfortunately it has Micron v-ram. Hopefully my card doesn't take a shit. So far, no problems to report after extensive 4k gameplay, and benchmarking.
 
Both cards micron. Been at 2 and 4 weeks for my cards. So far so good. Could be a bad batch of ram. I know that I can't push the ram on either card past 575mhz or it starts getting small artifacts in testing.

I wonder what the samsung mem oc limit is.
 
Thank you for the honest, independent investigation and analysis Kyle. I saw a user on the NVIDIA forums singling [H] out and accusing it of fearmongering regarding this issue. That must mean you're doing something right (y). I am very curious as to whether there is any clearly distinguishable differences between the two VRAM types.
 
While under 90° might be in spec (I'm guessing) I still dont think any engineer would think it as healthy long term.

I wonder, does GDDR6 just run that hot or is it a question of better/proper cooling?

Also there might be hardware faults we are not privy to with the Micron RAM; stuff that does deeper then just temps.
 
I have a Gigabyte Windforce 2080ti. Unfortunately it has Micron v-ram. Hopefully my card doesn't take a shit. So far, no problems to report after extensive 4k gameplay, and benchmarking.
All first run cards have Micron memory because NVIDIA had an exclusive deal with them for this launch, as far as I'm aware.
 
Kind of off topic but made me think of this thread - I just bought a GTX 1070 on eBay, and I'm assuming it was used in cryptomining, and I'm assuming it has a modded bios. So I went looking up the bios versions that are out on the interwebs. There's a whole lot of chatter of cards with Micron memory being shit overclockers, and at least Asus and Zotac released BIOS updates to fix instability with Micron memory on the GTX 1070.

Sorry if this is old news to you guys, I'm coming off of a Radeon HD6900 series, been on the sidelines a long time.
*edit*
Point being, this isn't nVidia/Micron/Samsungs first rodeo. Which IMO gives them even worse visibility.
 
While under 90° might be in spec (I'm guessing) I still dont think any engineer would think it as healthy long term.

I wonder, does GDDR6 just run that hot or is it a question of better/proper cooling?

Also there might be hardware faults we are not privy to with the Micron RAM; stuff that does deeper then just temps.

Engineer here.....thinking a number is high because it sounds high to you doesn't make it high. Even shit silicon is performance rated for a 105 deg C junction temperature and that temperature will result in a typical FIT rate. Most FIT rates, even for RAM, are pretty benign in a singular sense but in aggregate can indicate to a manufacture the warranty and overall failure rates. Regardless, these 'burnouts" have little to deal with the junction temperature...something else is going on. The only time you need a lower junction temperature than 105C is your die is so shit it cannot meat its performance requirements and gets binned into "super shit class" (0-40C ambient).

Something else is going on here besides the silicon itself.....conflating silicon temp and these board failures may land up leading you down the incorrect rabbit hole. I do not believe what is happening is intentional but could be a culmination of multiple factors resulting in "uh oh".
 
Last edited:
I have a Gigabyte Windforce 2080ti. Unfortunately it has Micron v-ram. Hopefully my card doesn't take a shit. So far, no problems to report after extensive 4k gameplay, and benchmarking.

I have a Zotac 2080ti Amp that also has Micron Ram. As these problems are primarily on the FE cards, especially the "Space Invaders" effect, I to am crossing my fingers/toes/eyes.....but so far it's a shit-ripper....and playing BF5 at 3440 x 1440 ultra works as smooth as butter on my Predator G-Sync monitor....but I'm still following this pretty closely.....yikes!!
 
  • Like
Reactions: mikeo
like this
Nvidia sure like to run the memory toasty warm on their cards.

01-PCB.jpg
 
It may be posted elsewhere, if it is I apologize I haven't seen it.

Any chance we could get a thermal shot of the Samsung vs Micron under similar conditions? I know you would have to pull the card and re-establish the test case, it's purely a curiosity thing.

Regretably a FLIR shot of the micron may not reveal the weakness of the memory. It could be how the memory is patterned internally. It's a matter of different silicon where a single trace or two could be making the failure.

So far all the major manufacturers are still reporting 1% return rates, which is normal. This might all be negative attention bias affect: People are more likely to be affected and report negative results than positive ones so the bias leans toward the negative. I'm not a fan of NVIDIA, but I think the jury is still out too early to determine. If there is a large defect return rate, they will have a legal obligation to present those losses to investors next quarter in the form of write-downs. (GAAP rules)

That said, the story should not be ignored and more data is always useful and appreciated at this early stage.
 
Last edited by a moderator:
Engineer here.....thinking a number is high because it sounds high to you doesn't make it high. Even shit silicon is performance rated for a 105 deg C junction temperature and that temperature will result in a typical FIT rate. Most FIT rates, even for RAM, are pretty benign in a singular sense but in aggregate can indicate to a manufacture the warranty and overall failure rates. Regardless, these 'burnouts" have little to deal with the junction temperature...something else is going on. The only time you need a lower junction temperature than 105C is your die is so shit it cannot meat its performance requirements and gets binned into "super shit class" (0-40C ambient).

Something else is going on here besides the silicon itself.....conflating silicon temp and these board failures may land up leading you down the incorrect rabbit hole. I do not believe what is happening is intentional but could be a culmination of multiple factors resulting in "uh oh".

Yes and no. Resitence is a function of voltage with silicon. And modern day memory uses LESS VOLTAGE. So you shouldn't be seeing high temps. You might fall out of the operating window with anything above 85-90C. The 105C limiter is just where silicon actually starts to degade, not just fail. That's a huge differentiation. For example: My chips are spec'd to run at 16V. I can run them up to 18V 100C. If I go over that, I will damage them.

Cooler is always better for reliability (to a certain point) That's why hard core overclockers use LN.

The only way to tell for sure is to have engineering test samples with test probes hooked up to the various lines to see where they are failing.
 
Last edited by a moderator:
95 degrees is the high side of spec.
All the eVGA cards are running in the 80s and low 90s.
 
A lot of arm chair experts here... "well it might be this!?!" "It might be that!?!"

They use Micron for the first run production, it's coincidence that the bad cards all got "Micron" because they ALL got Micron, and more recent manufacture has something else because that is what was available. Making a leap that Micron ram is a causal factor isn't backed by any facts that I am aware of...

Engineer here.....thinking a number is high because it sounds high to you doesn't make it high. Even shit silicon is performance rated for a 105 deg C junction temperature and that temperature will result in a typical FIT rate. Most FIT rates, even for RAM, are pretty benign in a singular sense but in aggregate can indicate to a manufacture the warranty and overall failure rates. Regardless, these 'burnouts" have little to deal with the junction temperature...something else is going on. The only time you need a lower junction temperature than 105C is your die is so shit it cannot meat its performance requirements and gets binned into "super shit class" (0-40C ambient).

Something else is going on here besides the silicon itself.....conflating silicon temp and these board failures may land up leading you down the incorrect rabbit hole. I do not believe what is happening is intentional but could be a culmination of multiple factors resulting in "uh oh".

This is much more plausible, however I think it is other components failing, not the RAM. The card that caught fire, had the issue at the very back end of the card, and there is no ram there. We do not have enough evidence to know what exact parts are the cause, but the one pinpointed failure, the fire, was not a ram module. All this speculation about the ram, is lemmings falling in a rabbit hole...

"Hey yeah, it could be.."

'Oh yeah, this might be.."

blah blah bullshit
 
Yes and no. Resitence is a function of voltage with silicon. And modern day memory uses LESS VOLTAGE. So you shouldn't be seeing high temps. You might fall out of the operating window with anything above 85-90C. The 105C limiter is just where silicon actually starts to degade, not just fail. That's a huge differentiation. For example: My chips are spec'd to run at 16V. I can run them up to 18V 100C. If I go over that, I will damage them.

Cooler is always better for reliability (to a certain point) That's why hard core overclockers use LN.

The only way to tell for sure is to have engineering test samples with test probes hooked up to the various lines to see where they are failing.

I'm sorry..but pretty much everything you just stated sounds smart enough to be right but wrong on so many levels. I don't even know where to start with the "resistance (sic) is a function of voltage with silicon" statement.
 
I'm sorry..but pretty much everything you just stated sounds smart enough to be right but wrong on so many levels. I don't even know where to start with the "resistance (sic) is a function of voltage with silicon" statement.

https://www.allaboutcircuits.com/textbook/direct-current/chpt-12/temperature-coefficient-resistance/

As you were saying? While silicon can decrease, it's interconnect points where it isn't silicon can increase dramatically and that creates issues. It leads to things like signal reflection. But I should have stated "The dye" over the "The Silicon" which would have been technically more correct.
 
https://www.allaboutcircuits.com/textbook/direct-current/chpt-12/temperature-coefficient-resistance/

As you were saying? While silicon can decrease, it's interconnect points where it isn't silicon can increase dramatically and that creates issues. It leads to things like signal reflection. But I should have stated "The dye" over the "The Silicon" which would have been technically more correct.

https://www.quora.com/What-is-the-relationship-of-temperature-with-voltage-and-current

edit: I always hated that crap in physics, I personally have no idea, just casually throwing gas on the fire for my viewing pleasure
 
Not sure either. Just saw it elsewhere. Maybe that differentiates between the revised cards?
Samsung Replacement card serial number - 0324518088XXX
Micron RMA's cars serial number - 0323918014XXX

Looking at the BIOS versions on the card's you have to wonder if the Micron card has a beta on it, 90.02.0B.00.0E. The Samsung's card is all numbers...
 
Whats the best way to test the memory? Got a 2080ti XC from evga. And it has micron ram. Played world of warcraft yesterday no issues.
 
Whats the best way to test the memory? Got a 2080ti XC from evga. And it has micron ram. Played world of warcraft yesterday no issues.
Not sure there is a way to "test" the memory. I did a 8 hour Heaven stress test with my card when I installed it, no issues. Just used it for gaming from then on. I have not seen any reports of being able to diagnose or predict an impending failure.
 
Whats the best way to test the memory? Got a 2080ti XC from evga. And it has micron ram. Played world of warcraft yesterday no issues.

EVGA has artifact scanner (there's check box to check for artifacts) - EVGA OC Scanner X - Also there's a video card memory test called Check Flash (ChkFlsh)
 
I got in my replacement Zotac Amp, still has the Micron memory and the S/N is identical besides the last two digits (34 numbers higher).

Now I'm afraid to touch the memory or the power limit. I am thankful it at least got through an hour of FC4 with no issues....
Honestly, I think I would push it inside of warranty limits. I would like to know if I got another "bad" card sooner rather than later. I would want to see it fail inside of warranty assuredly. My 2 cents, you may need change.
 
https://www.allaboutcircuits.com/textbook/direct-current/chpt-12/temperature-coefficient-resistance/

As you were saying? While silicon can decrease, it's interconnect points where it isn't silicon can increase dramatically and that creates issues. It leads to things like signal reflection. But I should have stated "The dye" over the "The Silicon" which would have been technically more correct.

You just admitted to conflating things that didn't match your original assertion. You said the voltage affected the resistance...and then you give me a link to how temperature affects resistance. *boggle* Again...now with signal reflection comment on top of it....seriously...just stop. And the "die" is the die...there is no bonding or packaging at that point...the die IS the result of process on the silicon. I know you are trying really hard, but right obvious of your experience and knowledge in this field is not productive when you responded to my original post. As a FYI, the reason they went away from lead attach bonding wires wasn't heat..it was a small fraction..but a very small.
 
You just admitted to conflating things that didn't match your original assertion. You said the voltage affected the resistance...and then you give me a link to how temperature affects resistance. *boggle* Again...now with signal reflection comment on top of it....seriously...just stop. And the "die" is the die...there is no bonding or packaging at that point...the die IS the result of process on the silicon. I know you are trying really hard, but right obvious of your experience and knowledge in this field is not productive when you responded to my original post. As a FYI, the reason they went away from lead attach bonding wires wasn't heat..it was a small fraction..but a very small.

I went off on a tangent with signal reflection. Tthat was from the fact interconnects are of different resistance than the silicon and that causes heat and signal reflection issues.

That said I am correct. The package as a whole changes as the resistance changes with heat. And my link backs that up. You said otherwise.
 
I went off on a tangent with signal reflection. Tthat was from the fact interconnects are of different resistance than the silicon and that causes heat and signal reflection issues.

That said I am correct. The package as a whole changes as the resistance changes with heat. And my link backs that up. You said otherwise.


Do you mean that the collisions from the different resistance (interconnects and silicon) are what's causing heating, to which in-turn the heat and vibrations are enhancing resistivity resulting in end result of artifacts, including from the reflections of this all happening?

If so, I could buy into that as 1 possible theory...

EDIT:
This covers temperature coefficient of resistance somewhat well:
https://www.allaboutcircuits.com/textbook/direct-current/chpt-12/temperature-coefficient-resistance/
 
Last edited:
Do you mean that the collisions from the different resistance (interconnects and silicon) are what's causing heating, to which in-turn the heat and vibrations are enhancing resistivity resulting in end result of artifacts, including from the reflections of this all happening?

If so, I could buy into that as 1 possible theory...

EDIT:
This covers temperature coefficient of resistance somewhat well:
https://www.allaboutcircuits.com/textbook/direct-current/chpt-12/temperature-coefficient-resistance/

Lol. That's the link I posted a couple back
 
Back
Top