RTX Space Invaders Wanted

Ok so 2 for 2 (sorta) in bad cards being unreliably bad for them... Could mounting have something to do with it?
For example, was your's (that you originally sent to them) and Akbar's cards installed in a case that had the card mounted traditionally (horizontal, fans facing down), but at the firm it had been mounted in a test bench (vertical)?

I ask because there's a chance that the weight induced flexing of the card may be causing a component or the GPU chip to be twisting ever so slightly. If it is a solder issue, then that twisting may be exacerbating the flaky connection just enough to have it manifest... or in this case, to stop manifesting.


It's either that, or nVidia has somehow gotten to your firm and paid them off, and they just so happen to be unable to reproduce the problem! :eek: However, given we don't know who you've used, and I'm not sure anyone even knew you were doing this with your first card... this comment is meant to be taken purely in jest. lol
(Either that, or it IS nanobots programmed to sabotage that are doing their job too soon....)


The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.
 
The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.

Don't try to use logic here. That ruins their logic if you try to use better logic.

I don't think people are exactly grasping that two cards now had issues going into testing, had issues during the first few test, got to the thermal testing (lets say that is test 5 in a line of test) where they made the cards run hot for a few hours then after running hot the issues fixed themselves and now they can't get them to miss up again. Instead of the cards "fixed" themselves during testing, they keep wanting to have it in their heads that Kyle has no idea what he is doing, gave his cards to some random guy on the street and that nobody involved knows how to test the cards and can't figure out how to recreate the issues period.
 
Don't try to use logic here. That ruins their logic if you try to use better logic.

I don't think people are exactly grasping that two cards now had issues going into testing, had issues during the first few test, got to the thermal testing (lets say that is test 5 in a line of test) where they made the cards run hot for a few hours then after running hot the issues fixed themselves and now they can't get them to miss up again. Instead of the cards "fixed" themselves during testing, they keep wanting to have it in their heads that Kyle has no idea what he is doing, gave his cards to some random guy on the street and that nobody involved knows how to test the cards and can't figure out how to recreate the issues period.
His name was Chuck, and he had a cool PT Cruiser.
 
the amount of people accusing Kyle of sending the cards to an incompetent bunch of hacks is funny.
Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.

The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.
In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.


I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.
 
Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.


In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.


I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.

You act like this is the only card to have this issue. You can absolutely rule out case issues. Unless you want to test every computer case out there.
 
Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.


In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.


I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.

My comment is not an attack on anyone directly but is meant generically to poke fun at all the people trying to question if this company ever thought about testing doing <insert method>, or are they sure they know how to test a card for issues period, regardless of the degree of ill intent of the poster.

Kyle sent the card to a company that makes their money and livelihood testing hardware and finding issues with it for companies. This is what they do all day every day, so one should be able to assume that if they are actually good at doing their job they will put hardware through proper testing otherwise I don't know why anyone would use them if they couldn't find a single issue with any problem hardware. That would be like a machinic shop that has not be able to fix a single car brought to them staying open. No that doesn't not mean they will find every possible issue with every device, no that doesn't mean they are perfect. However when an issue exist that effects a decent percentage of a product they should be able to find a common thing among that product line given a few of the items to test and being able to follow the trends of what works or doesn't work. They also should have a decent set of testing protocols to account for different things as in this type of setting they are going to be looking for very specific things which are wrong which is going to require very specific testing to find different types of issues.
 
You act like this is the only card to have this issue. You can absolutely rule out case issues. Unless you want to test every computer case out there.
You misunderstood me. Wasn't at all referring to cases in general. Simply the way a video card is mounted when in a case, versus how it's mounted on a test bench (tech bench). In a case the weight of the card will cause a slight sag/bending of the PCB. In turn, the components will be under an ever so slight tension.

It's all good. :p


My comment is not an attack on anyone directly but is meant generically to poke fun at all the people trying to question if this company ever thought about testing doing <insert method>, or are they sure they know how to test a card for issues period, regardless of the degree of ill intent of the poster.

Kyle sent the card to a company that makes their money and livelihood testing hardware and finding issues with it for companies. This is what they do all day every day, so one should be able to assume that if they are actually good at doing their job they will put hardware through proper testing otherwise I don't know why anyone would use them if they couldn't find a single issue with any problem hardware. That would be like a machinic shop that has not be able to fix a single car brought to them staying open. No that doesn't not mean they will find every possible issue with every device, no that doesn't mean they are perfect. However when an issue exist that effects a decent percentage of a product they should be able to find a common thing among that product line given a few of the items to test and being able to follow the trends of what works or doesn't work. They also should have a decent set of testing protocols to account for different things as in this type of setting they are going to be looking for very specific things which are wrong which is going to require very specific testing to find different types of issues.
Fair enough :)
Thank you for clarifying.
 
Ive been traveling around the Rockies with spotty network access, just checking back in. So they were able to reproduce the problem, but then went away after the thermal stress testing? Bummer. Hopefully they can find soe sort of anamoly, or maybe nvidia just isn't baking the card long enough on the production line.
 
We are consistently reproducing the issue again and are finalizing the thermal chamber testing tomorrow (hopefully). Was supposed to be done that today, however we’ve had some serious weather and the lab techs didn’t make it to the office.


Next up will be the assessment of the DRAM’s.
 
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?
If they manage to isolate what is doing it, then it will be easily replicated across all cards that possess the flaw. Possibly even on cards that don't obviously possess it. I'm reaching a bit here because I don't know what they've found out. However, I have faith. Now I will shut up and go back to waiting patiently ;)
 
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?

while in theory yes there could be different issues it is less likely that if 3000 cards all have the same issue that you have 3000 different issues vs you have 3000 with the same issue. Like legendary gamer said once they know the case on one card they can use that to look for this issue on all other cards. You always have to start with one to find a place to start.
 
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?
I would suggest if other cards are seeing exactly the same issue, there would be a high probability of those cards suffering from the same issue. We only have so many resources to work with, but I don't think that negates us wanting to find out after the non-explanation we got directly from NVIDIA.
 
Kyle, can you say whether there will be an article on what you've discovered based on this test sample?
 
Not trying to suggest it negates the effort. I don't know what they found, I just know of the failure symptom. What I am asking is can this type of failure symptom be generated by a different failure than they uncovered, like an engine stalling can be caused by numerous things, or is this something that the failure symptom really only points to a single fault point? It will be an interesting read regardless.
Don't know and don't have the resources to purchase and evaluate thousands of cards in order to satisfy your suggestion. Quite frankly, I think all you are discussing here is rather obvious. But thanks for pointing it out anyway. I am sure this will be brought up by all the folks wanting to proclaim the research is not valid when/if it gets published. I would expect no less. Quite frankly, these are some of the reasons I even ask myself why I attempt to do things like this.
 
This kind of analysis keeps these companies feet to the fire.

It's thankless work, but some of us here really appreciate those going beyond the benchmark and RMA experience reporting.

Sometimes you just gotta know why and it itches at your brain until you do.
 
I think you misunderstood me. I am not questioning whatever the results are, or how they were found. Nor am I suggesting you need to buy a bunch of cards.

All we get to see is Space Invaders, but outside of it's broke, I really don't know what it means. I don't understand what failure it signifies. From the sound of it, you suggest you may have found the cause for your card, which is great. I would assume this also means they know what seeing Space Invaders signifies beyond the card is broken. Which leads to the question, is this failure something that looks to have a specific cause, or is it something that could have a few potential causes, but you have at least identified one?

I am not in anyway trying to minimize your effort on this, or suggest it is invalid in anyway. You are right, nVidia gave us a non-answer, and an answer, supported by actual engineering failure analysis, would be huge, even if limited to a specific card. It still more of answer than anyone else has gotten.
You can read up on the failure here, and there are thousands of more examples of this.

GeForce RTX 2080 Ti FAILS After Gaming for 2 Hours
 
Forgive me for I am stupid, but I don't understand what that means. Is that what happens when the GPU sees too high or low of voltage levels? Tries to process bad data and says fuck it I quit? Gets too hot and starts internally shorting?

Or is that what you are finding out?
 
I agree with nVidia's "test escapes" answer being bullshit. After this testing do you feel as you can answer what is happening on the card when see it lock up with hard artifacting? Not even the root cause, but just what failure signifies having occurred? Because even that is significant, as I don't think anyone has been able to say on a single card what it means is happening.
 
I agree with nVidia's "test escapes" answer being bullshit. After this testing do you feel as you can answer what is happening on the card when see it lock up with hard artifacting? Not even the root cause, but just what failure signifies having occurred? Because even that is significant, as I don't think anyone has been able to say on a single card what it means is happening.
Will know after testing.
 
Update:

Hey Kyle,


We are still working away on the first board and are able to consistently produce the error. No rush on finding a second board at this point, we aren’t quite sure how we will proceed with testing:


  • Our thermal labs finished their testing Friday afternoon. 14 separate controlled cold boot trials were conducted throughout the testing and we were unable to corelate ambient or GPU temperatures with the issue. Artifacts were present in 50% of our tests and would always occur within less than a minute of running a 3D application.

  • Next step: There is some back and forth debate about how to proceed, both options to proceed with testing are destructive. I should have a better idea by mid-week. We are on a long weekend up here, so it’s been difficult to get some of our engineers time.
 
So it's not temp related.

Interesting that the next step is destructive. Could that mean removal of some parts such as memory chips or the GPU core for testing.
 
Dang, well it's starting to look like something... significant... is borked on the cards if it's not a 'simple' thing like temperature and may need a more destructive look at the card. Thinking Nvidia sorta rolled the dice on this problem popping up in any way meaningful and rather than recalling cards they just quietly worked on a fix? eh - who knows.
 
I'm just going to wait for the results patiently. Kyle and the people he has working on this can take all the time necessary to figure this stuff out. At this point we've all speculated what might be the cause of the issues. I'm not expecting there to be a smoking gun at this point, however, my hope is that this might open some damn eyes and make a difference for the better.

I have these bizarre issues with my card that don't outright tell me it's failing but I don't believe it's stable either. I've had the occasional system lock, wierd static screens and pixelating issues in web browsers as well as oddities in screen blanking, video playback and more. Some can be blamed on drivers. The rest has to be the damn silicon. Because, toss in any previous gen Nvidia part and all my issues disappear instantly.

This entire search for the truth is why I will always place my faith in the [H]. Not many people have the guts or perseverance to follow anything through to conclusion these days.
 
I have these bizarre issues with my card that don't outright tell me it's failing but I don't believe it's stable either. I've had the occasional system lock, wierd static screens and pixelating issues in web browsers as well as oddities in screen blanking, video playback and more. Some can be blamed on drivers. The rest has to be the damn silicon. Because, toss in any previous gen Nvidia part and all my issues disappear instantly.
Adding a little more speculation to the fire...

It'd be definitely interesting if it somehow ended up being something simple like not enough power at low clocks, both for your case and the Space Invaders cards. Such as the transition from 2D clocks to 3D clocks, and/or the slow activation of the additional power phases, which induce the artifacting lockup. So it could very well be legitimate Test Escapes and it's just substandard MOSFETs or Chokes that oscillate and cause the instability at low clocks.

But what I did find rather interesting was that they felt that another Space Invaders sample card was not needed... Edit: Well, ok, technically it didn't say one wasn't needed full-stop, but I'd still think that with a sample size of 1 that it'd be a case of more-the-merrier. So their comment is still curious either way.


AFTER-THOUGHT: Legendary Gamer - If you get bored and haven't tried it already, maybe try forcing 3D clocks all the time, to see if your "2D" hiccups persist. It's a simple enough test, which at least doesn't require dissecting your card to determine! lol
 
Last edited:
Adding a little more speculation to the fire...

It'd be definitely interesting if it somehow ended up being something simple like not enough power at low clocks, both for your case and the Space Invaders cards. Such as the transition from 2D clocks to 3D clocks, and/or the slow activation of the additional power phases, which induce the artifacting lockup. So it could very well be legitimate Test Escapes and it's just substandard MOSFETs or Chokes that oscillate and cause the instability at low clocks.

But what I did find rather interesting was that they felt that another Space Invaders sample card was not needed... Edit: Well, ok, technically it didn't say one wasn't needed full-stop, but I'd still think that with a sample size of 1 that it'd be a case of more-the-merrier. So their comment is still curious either way.


AFTER-THOUGHT: Legendary Gamer - If you get bored and haven't tried it already, maybe try forcing 3D clocks all the time, to see if your "2D" hiccups persist. It's a simple enough test, which at least doesn't require dissecting your card to determine! lol
That's a good idea, I won't have to run the 1500 watt space heater in my den anymore either ;). Worth a shot, I will see if I can light it on fire tonight
 
That's a good idea, I won't have to run the 1500 watt space heater in my den anymore either ;). Worth a shot, I will see if I can light it on fire tonight
haha Well, thankfully, heat is generated as a result of load, which would be volts x amps. However, even then there's efficiency at play, as well as Watts of power not equating to Watts of "heat" (BTUs). :p Either way, at desktop "load" there won't be a need for too much amperage, even at full speed, so I wouldn't imagine the heat output would raise too considerably much. Your OCed Ryzen as an example, assuming it isn't OCed via PStates and still utilizing Cool n' Quiet, stays rather cool despite running full clocks.

Alas, I do understand what you meant either way, as I grew up in Minnesota and my basement bedroom had an electric baseboard heater. However, I never used it in all the years I lived in that room because I kept my computer running 24/7 (Athlon Classic 550MHz). When I built my AthlonMP system a couple years later, I actually needed to keep my door open or else my room would get so hot (even in winter) that I'D be overheating! lol
 
Any update on this?
This stuff takes time. You know there's going to be a Discovery article all about it once all the findings are in, regardless of what is actually discovered.

The issues with the 20 series have been so elusive I imagine it's not an easy analysis. Like you, that question is always on my mind but I gotta let the professionals do their thing without bugging the shit out of them ;)
 
Any update on this?
We were getting the card through a bunch more tests, xray etc., before moving to destructive testing.

Nothing but theories at this point, no hard data to go on.


The lab has been investigating if it is a memory training issue at boot, while some board layout folks are looking closer at our board x-rays of the PCB layout (memory tracings).


I was hopeful that we could discover a root cause while keeping the board functional, but I am starting to think that I may need to send the board off to our imaging labs for destructive analysis next week. The test procedures that keep the board functional are not giving us consistent or actionable data…

That all said, I have been getting some back-channel communications from AIBs, which is NOT easy to come by on this, as to why these cards are failing. NVIDIA is putting the hammer down on all these folks to be very quiet about this. However, we have gotten a few comms on this as to what they are seeing, but it is up to up to see if this card is sharing those same issues. Those theories will not be proven without destructive testings, and we are just about to that avenue.
 
Back
Top