Separate names with a comma.
Discussion in 'nVidia Flavor' started by Kyle_Bennett, Nov 9, 2018.
It took me a second to get it.
I'm on week two of my FE card still running fine. Hoping it stays that way, but who knows.
Damn. Which Asus model is that? A vanilla or one of the ROG STRIX?
It is the Dual-RTX2080TI-011G, was little over a month ago, had a bunch of nowinstock alerts set at the time and was able to snag one of these for the minute it was in stock at newegg.
Thanks for your reply. I think that card is using a reference PCB, unlike the Asus ROG STRIX series. I know it has a different cooler setup, but probably still nVidia reference PCB.
Agreed, both Kyle and Steve deserve RESPEK for their combative attitude towards corporate BS.
Well my Gainward RTX 2080Ti Phoenix Golden Sample which I got monday this week also decided it wanted to be RMA'ed. Suddenly today wanting to game, it would display star flashing like artifacts in all different colours - looking like something on a christmas tree, and afterward the image would stutter to a halt and the then black screen, and then only a hard reset would get me back into windows. The card has now been packed and I'm awaiting an RMA number....
My wifes MSI RTX 2080 Duke OC knocking on wood still lives.
Both my friend and I are on our 2nd 2080 Ti ... I sold my first system I built a few weeks ago. I gamed the hell out of that card and have gamed the hell out of this one as well.
I have many many many hours "overclocked" on both. The 2nd card, the past 2+ weeks .. .6 to 8 hours a day of Black Ops 4 Black Out.
The good news is, you guys will get new cards. The bad news, it's gonna be a hassle or two.
This is so sad... And despite the absurd pricing it is hard to get a good custom like EVGA FTW or Asus Strix here in Germany...
I don't really see much of a issue, as long as they honor their warranties. Probably some sort of electromigration issue, hopefully a slight revision in the stepping will solve these issues. Either way sucks for owners who have to deal with the RMA process on their new 1000$+ cards, and really sucks for the guys who are slapping water blocks on day one who are just SOL (although some places still honor those warranties)
Electromigration after a few hours?
If the numbers pan out in line with [H]'s poll (yes I know all the issues with the small poll) and failure is around the 10% mark, after a few hours use, then it speaks more to a manufacturing issue of some kind (GDDR, PCB, thermals).
I had two lockups with Hunt Showdown a Crytek rep from Korea on Steam told me to disable start up programs while the game was running at the same time which is a old Pc trick back from the Ultima Online or Quake days.
ya its very possible if the traces are not of the proper size in some parts. Failures like this are very common on early die revisions. I wonder what the stepping number they are on. A,B,C? most A dies have issues like this, and they are fixed with a major b stepping.
I'm fairly certain you wont get electromigration after a few hours
I logged in after a long time just to say that this statement made me ... for the first time in my life ... spill out coffee over my keyboard.
edit: On another note... who is "HotHardware" anyways, I scrolled through 2 years of their videos on YT to be greeted by video screenshots that basically cover everything about NVIDIA ... and when it comes to AMD they have 1-2 vids about Ryzen CPUs and some review of a workstation PRO card.
Nothing about Vega, Polaris etc.. no gaming stuff. Are they strongly affiliated with NV or is that just a very strange coincidence??
Ouch! Failures on new kit are never fun. I had an MSI 970 that was notorious for failures; after the 2nd RMA I just choose to write it off
I nabbed a Asus 180ti factory overclock for MSRP on launch last year, and I was a little wary of being in the first wave of a new generation. There are enough of these stories with the 2080's and the lack of significant performance differences to where I feel no need to replace my 1080ti.
I sure wish I would have sold it to the cryptominers when they were going for over $1K - oh well...
This dude had row MSI died lol and comment right below it had gigabyte die. May be nvidia just rushed the big ass die? Hopefully everyone who spent shit load of money on these cards is taken care of.
No, and there is a theory that maybe, custom PCBs are fair better in regards to these issues. It may be the reference PCB design, that has issues, the PCB, or the memory. If it's a custom PCB like on the ASUS STRIX, issues don't seem to occur (to my knowledge) but this is all a theory at this point.
then you don't understand what electromigration is. You are just confused because most retail parts don't have these issues as they have been tested, and revised to avoid such issues. Electromigration can happen at any point of a products lifespan. If the die is flawed from the start you can see failure on first use or hours after.
2080 Ti FE running since 10/5 with tons of multi-hour sessions of ARK: Survival Evolved under my belt. No issues.
Been playing a ton of BFV this weekend, too - no issues.
This. Most hardware failures occurs in the first two weeks, look at bell curves.
well after a little over week no problem with my Strix yet
knock on wood and everything
The internal metallization layers are susceptible to electromigration, physical traces are not, except at Extreme current densities; and 100A at 1.4V won't really electromigrate except as a cloud of copper vapor, if you're talking about interposer or PCB traces.
The power traces should be on a whole LAYER, called a Power Plane; they're connected by at least 1 via per pin to the die, vertically thru the PCB.
The internal metallization is, but the current densities on a 100micron deposited metal interconnect is pretty extreme.
But this would be a hard fail; like poof, it's gone, and shorting a power supply rail due to metal vapor all over the chip, and volcanoing the chip. (the hole in the chip looks like a volcano; btdt) (I can't find a reference to the term, must be an Older American term.)
WTF? Hardware failures are Not a Bell curve, it's called a Bathtub Curve, and the lower end is "Infant Mortality", the upper end is "Wearout Phenomenon", which includes electromigration as a cause.
If that's a Ham license as a name, you need to retake some tests...
Electromigration in a new product is referred to by the Good Engineers in the group as "Lightbulb Effect", and you'll be needing a new job after one of those designs hit the field, lol.
Disclaimer: I design Electronics...
The heat that was shown in one of the thermal images show it's the actual Memory parts that are overheating, 90°C is not an operating temp for a memory chip, for very long; memory errors are almost certain at those temps.
I looked up the datasheet; there's some serious bogosity there. Micron is saying the Case temperature is ok at 95°C, but the maximum Storage temperature is 125°C.
If the case is at 95 degrees, and the max is 125, that's BS; there's no way.
The datasheet says there's an ON DIE temperature sensor; anyone got data from one, on a card that's failing?
Don't know about USA, but in Europe guys who are returning their 2080Ti's are been told that the delays are because of unusually high return rates.
Maybe because their consumer protection laws don't permit lying to their customers?
If you cut out the lies, there's wouldn't be any commercials here.
My understanding is that the EVGA XC (and XC Utlra) cards as well as the MSI Ventus cards use the reference boards. Has anyone found where these are failing?
Doesn't electromigration happen over time, as in eventually, as a result of actually running the IC?
Whoops, similar curve just inverted aka wrong name, I don't do mathsy graph distribution stuff like that much for my area of work. Pretty clear what I meant though, most failures within first two weeks would not be a bell curve now, would it..
Anyway, that's what I was told by a large manufacturer of laptop systems when I worked for them and also handled warranty + insurance claims, I have walked the walk over multiple launches, not just engineered it
Could storage temp be higher because the memory doesn't have to reliably store data at that temp? Is it possible that it can run hotter, but the hotter it runs the more errors it may have?
To me the behaviour looks like what happens when you OC memory too much.
Storage temperature is a function of the Silicon; over The storage temp (125C) it starts rediffusing the N and P type material back into random silicon.
I've ran 2N3055 transistors at 185 degrees C, and after a while, they aren't transistors anymore; at least, they won't turn OFF anymore.
That's why the really high power devices for Radio transmitters are still Toobz. Water Cooled Tubes, but still vacuum tubes.
The package temp is always lower than the core temp, and I'd love to know what the interior temp is when the outside is at 90C.
This video shows WTF 90C I'm talking about:
Also realize: this is the back side of the card, and there's a heat sink on the other side of the die; this is thru the PCB. :O
The max temp is a function of the complexity of the chip; finer architecture is more sensitive.
I'd bet memory is at the bleeding edge of complexity, like microprocessors.
I did find a weird effect, tho: a Flash-based Digital Potentiometer we used was suspected of a problem, so we tested it to failure in an environmental chamber.
It would fail after ~1M write cycles, and fail to overwrite 1's. We found by accident that heatsoaking it at 160C for an hour allowed us to write to it another 1M times, and it could be fixed multiple times.
We never expected that at all.
The guys that made the chip were like, "that voids the warranty, and we don't advise that", lol.
There are Silicon Carbide, Gallium Nitride, and Diamond transistors coming onto the market that are good to 1000°C+. When they make chips out of those, we won't need water blocks, we can use molten Tin cooling loops.
Good start. How do we make it work?
Is this symptomatic of what other people are seeing?
No, I haven't seen that on mine.
Have not seen that on mine. My 2nd screen is only a 60hz model which I keep portrait mode for browsing. nvidia has some weird bugs with multi monitor which may or may not be killing the cards. I know multi monitor runs ~55W idle which is excessive. I also have seen in notes about edge browser had 2d issues in the past, could also be driver related hitting other browsers as well.
I've had mine get stuck at 1350 MHz and ~70W idle and G-SYNC wouldn't turn back on a few times. It is usually idling at 300 MHz at ~32W. Temps seem to have dropped with the driver update this week. The power consumption numbers are what HWiNFO64 reports.
The left monitor is 4K 60 Hz. The right one is 1440p running at 144 Hz.
I could easily be wrong but that sounds like the driver issue that NV recently put a hotfix out for. If so, I wouldn't be surprised if there's still some situations still occurring. What I remember is that it had something to do with 2 monitor, g-sync and non-g-sync combos.
"Torturing users, reviewers, vendors, and wallets alike, Turing promises you pay now and play later"
Ugh mine started locking up like this while playing last night. I didn't get any artifacting or crashes, but brief lock ups that definitely seemed different from frame drops. Plus the game I was playing wasn't taxing for the card either. As with my other issues i've had, this was using gamestream to my TV while playing at a locked 60hz. I hope your video isn't a sign of things to come for me. At least I haven't sold my 1080 yet.