RTX Space Invaders Wanted

Exavior · Feb 5, 2019

the amount of people accusing Kyle of sending the cards to an incompetent bunch of hacks is funny.

wizzi01 · Feb 5, 2019

Formula.350 said:
Ok so 2 for 2 (sorta) in bad cards being unreliably bad for them... Could mounting have something to do with it?
For example, was your's (that you originally sent to them) and Akbar's cards installed in a case that had the card mounted traditionally (horizontal, fans facing down), but at the firm it had been mounted in a test bench (vertical)?

I ask because there's a chance that the weight induced flexing of the card may be causing a component or the GPU chip to be twisting ever so slightly. If it is a solder issue, then that twisting may be exacerbating the flaky connection just enough to have it manifest... or in this case, to stop manifesting.

It's either that, or nVidia has somehow gotten to your firm and paid them off, and they just so happen to be unable to reproduce the problem! However, given we don't know who you've used, and I'm not sure anyone even knew you were doing this with your first card... this comment is meant to be taken purely in jest. lol
(Either that, or it IS nanobots programmed to sabotage that are doing their job too soon....)

The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.

Exavior · Feb 5, 2019

wizzi01 said:
The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.

Don't try to use logic here. That ruins their logic if you try to use better logic.

I don't think people are exactly grasping that two cards now had issues going into testing, had issues during the first few test, got to the thermal testing (lets say that is test 5 in a line of test) where they made the cards run hot for a few hours then after running hot the issues fixed themselves and now they can't get them to miss up again. Instead of the cards "fixed" themselves during testing, they keep wanting to have it in their heads that Kyle has no idea what he is doing, gave his cards to some random guy on the street and that nobody involved knows how to test the cards and can't figure out how to recreate the issues period.

FrgMstr · Feb 5, 2019

Exavior said:
Don't try to use logic here. That ruins their logic if you try to use better logic.

I don't think people are exactly grasping that two cards now had issues going into testing, had issues during the first few test, got to the thermal testing (lets say that is test 5 in a line of test) where they made the cards run hot for a few hours then after running hot the issues fixed themselves and now they can't get them to miss up again. Instead of the cards "fixed" themselves during testing, they keep wanting to have it in their heads that Kyle has no idea what he is doing, gave his cards to some random guy on the street and that nobody involved knows how to test the cards and can't figure out how to recreate the issues period.

His name was Chuck, and he had a cool PT Cruiser.

Formula.350 · Feb 5, 2019

Exavior said:
the amount of people accusing Kyle of sending the cards to an incompetent bunch of hacks is funny.

Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.

wizzi01 said:
The firm Kyle sent the card to did have the card artifact for them before they started the tests. So, that would rule out case issues.

In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.

I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.

wizzi01 · Feb 6, 2019

Formula.350 said:
Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.

In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.

I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.

You act like this is the only card to have this issue. You can absolutely rule out case issues. Unless you want to test every computer case out there.

Exavior · Feb 6, 2019

Formula.350 said:
Hopefully you're not implying I am one of those people, as I assure you that wasn't what I was trying to convey.

In all my years, having experienced all the quirky things I have when it comes to computers, or electronics in general, I can confidently say: No, it doesn't rule it out at all.

Look at it this way... Take some rubber tubing and run one end under super hot water for a minute. Hold it between your fingers super tight whilst running it under really cold water for a couple seconds; it's now taken on that squished shape. Try to run water through it and it'll spray all over due to that pinched end. This is akin to what I was stating about the card. So when the firm first received it, the components would have cooled in that same state and the first test or two may yield those results even if the card is oriented differently.

Thing is, if you run hot water through it again it'll expand, and once it cools it's back to normal. Well same with the card. It's no longer in a state of tension, so components relax and make better contact; the artifacting no longer presents itself.

I dunno, the amount of times in my life are numerous, where I've been correct by suggesting something so idiotically simple such as this. Occam's Razor, in other words.

The irony here is that I'm implying that there could very well be a solder issue with microfractures and the card's orientation is making matters worse or better, despite the problem still being there. Send it back to Ackbar and let him run it for a few days how he had been. But I reiterate my point to Exavior: I'm not implying they're dumb, but often times the smartest of people fail to try the most simple things, and that's all I'm trying to say.

My comment is not an attack on anyone directly but is meant generically to poke fun at all the people trying to question if this company ever thought about testing doing <insert method>, or are they sure they know how to test a card for issues period, regardless of the degree of ill intent of the poster.

Kyle sent the card to a company that makes their money and livelihood testing hardware and finding issues with it for companies. This is what they do all day every day, so one should be able to assume that if they are actually good at doing their job they will put hardware through proper testing otherwise I don't know why anyone would use them if they couldn't find a single issue with any problem hardware. That would be like a machinic shop that has not be able to fix a single car brought to them staying open. No that doesn't not mean they will find every possible issue with every device, no that doesn't mean they are perfect. However when an issue exist that effects a decent percentage of a product they should be able to find a common thing among that product line given a few of the items to test and being able to follow the trends of what works or doesn't work. They also should have a decent set of testing protocols to account for different things as in this type of setting they are going to be looking for very specific things which are wrong which is going to require very specific testing to find different types of issues.

Formula.350 · Feb 6, 2019

wizzi01 said:
You act like this is the only card to have this issue. You can absolutely rule out case issues. Unless you want to test every computer case out there.

You misunderstood me. Wasn't at all referring to cases in general. Simply the way a video card is mounted when in a case, versus how it's mounted on a test bench (tech bench). In a case the weight of the card will cause a slight sag/bending of the PCB. In turn, the components will be under an ever so slight tension.

It's all good.

Exavior said:
My comment is not an attack on anyone directly but is meant generically to poke fun at all the people trying to question if this company ever thought about testing doing <insert method>, or are they sure they know how to test a card for issues period, regardless of the degree of ill intent of the poster.

Kyle sent the card to a company that makes their money and livelihood testing hardware and finding issues with it for companies. This is what they do all day every day, so one should be able to assume that if they are actually good at doing their job they will put hardware through proper testing otherwise I don't know why anyone would use them if they couldn't find a single issue with any problem hardware. That would be like a machinic shop that has not be able to fix a single car brought to them staying open. No that doesn't not mean they will find every possible issue with every device, no that doesn't mean they are perfect. However when an issue exist that effects a decent percentage of a product they should be able to find a common thing among that product line given a few of the items to test and being able to follow the trends of what works or doesn't work. They also should have a decent set of testing protocols to account for different things as in this type of setting they are going to be looking for very specific things which are wrong which is going to require very specific testing to find different types of issues.

Fair enough

Thank you for clarifying.

Admiral_Ackbar6 · Feb 10, 2019

Ive been traveling around the Rockies with spotty network access, just checking back in. So they were able to reproduce the problem, but then went away after the thermal stress testing? Bummer. Hopefully they can find soe sort of anamoly, or maybe nvidia just isn't baking the card long enough on the production line.

FrgMstr · Feb 12, 2019

We are consistently reproducing the issue again and are finalizing the thermal chamber testing tomorrow (hopefully). Was supposed to be done that today, however we’ve had some serious weather and the lab techs didn’t make it to the office.

Next up will be the assessment of the DRAM’s.

Nimisys · Feb 12, 2019

It thought it could escape testing a second time...

Legendary Gamer · Feb 12, 2019

wizzi01 said:
Is the lab in the midwest? we had a pretty bad ice storm last night/today.

My guess would be Texas, perhaps Austin area. Supposed to be the next Silicon Valley.

Exavior · Feb 13, 2019

wizzi01 said:
Is the lab in the midwest? we had a pretty bad ice storm last night/today.

yeah, that damn ice storm has caused me to be at work for the last 22 hours... only a few more hours to go.

Rod Knock · Feb 15, 2019

I'm curious as well to find out what's behind the failures.

FrgMstr · Feb 15, 2019

PhoenixGenesys said:
I'm curious as well to find out what's behind the failures.

I think we have a handle on it, just trying to prove it 100%.

Legendary Gamer · Feb 16, 2019

Nimisys said:
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?

If they manage to isolate what is doing it, then it will be easily replicated across all cards that possess the flaw. Possibly even on cards that don't obviously possess it. I'm reaching a bit here because I don't know what they've found out. However, I have faith. Now I will shut up and go back to waiting patiently

Exavior · Feb 16, 2019

Nimisys said:
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?

while in theory yes there could be different issues it is less likely that if 3000 cards all have the same issue that you have 3000 different issues vs you have 3000 with the same issue. Like legendary gamer said once they know the case on one card they can use that to look for this issue on all other cards. You always have to start with one to find a place to start.

FrgMstr · Feb 16, 2019

Nimisys said:
While it will make for a great editorial, when you have a sample size of one how well does it translate to the other cards with the same issue? Or can this failure symptom only be caused by one type of fault with the board?

I would suggest if other cards are seeing exactly the same issue, there would be a high probability of those cards suffering from the same issue. We only have so many resources to work with, but I don't think that negates us wanting to find out after the non-explanation we got directly from NVIDIA.

Ehren8879 · Feb 16, 2019

Kyle, can you say whether there will be an article on what you've discovered based on this test sample?

FrgMstr · Feb 16, 2019

Ehren8879 said:
Kyle, can you say whether there will be an article on what you've discovered based on this test sample?

Don't know yet.

FrgMstr · Feb 16, 2019

Nimisys said:
Not trying to suggest it negates the effort. I don't know what they found, I just know of the failure symptom. What I am asking is can this type of failure symptom be generated by a different failure than they uncovered, like an engine stalling can be caused by numerous things, or is this something that the failure symptom really only points to a single fault point? It will be an interesting read regardless.

Don't know and don't have the resources to purchase and evaluate thousands of cards in order to satisfy your suggestion. Quite frankly, I think all you are discussing here is rather obvious. But thanks for pointing it out anyway. I am sure this will be brought up by all the folks wanting to proclaim the research is not valid when/if it gets published. I would expect no less. Quite frankly, these are some of the reasons I even ask myself why I attempt to do things like this.

Ehren8879 · Feb 16, 2019

This kind of analysis keeps these companies feet to the fire.

It's thankless work, but some of us here really appreciate those going beyond the benchmark and RMA experience reporting.

Sometimes you just gotta know why and it itches at your brain until you do.

FrgMstr · Feb 16, 2019

Nimisys said:
I think you misunderstood me. I am not questioning whatever the results are, or how they were found. Nor am I suggesting you need to buy a bunch of cards.

All we get to see is Space Invaders, but outside of it's broke, I really don't know what it means. I don't understand what failure it signifies. From the sound of it, you suggest you may have found the cause for your card, which is great. I would assume this also means they know what seeing Space Invaders signifies beyond the card is broken. Which leads to the question, is this failure something that looks to have a specific cause, or is it something that could have a few potential causes, but you have at least identified one?

I am not in anyway trying to minimize your effort on this, or suggest it is invalid in anyway. You are right, nVidia gave us a non-answer, and an answer, supported by actual engineering failure analysis, would be huge, even if limited to a specific card. It still more of answer than anyone else has gotten.

You can read up on the failure here, and there are thousands of more examples of this.

GeForce RTX 2080 Ti FAILS After Gaming for 2 Hours

SeymourGore · Feb 16, 2019

I bet Nvidia will be so happy that you figured out this problem for them that they'll forgive you for ruining their GPP feature, Kyle!

Nimisys · Feb 16, 2019

Forgive me for I am stupid, but I don't understand what that means. Is that what happens when the GPU sees too high or low of voltage levels? Tries to process bad data and says fuck it I quit? Gets too hot and starts internally shorting?

Or is that what you are finding out?

Nimisys · Feb 16, 2019

I agree with nVidia's "test escapes" answer being bullshit. After this testing do you feel as you can answer what is happening on the card when see it lock up with hard artifacting? Not even the root cause, but just what failure signifies having occurred? Because even that is significant, as I don't think anyone has been able to say on a single card what it means is happening.

FrgMstr · Feb 16, 2019

Nimisys said:
I agree with nVidia's "test escapes" answer being bullshit. After this testing do you feel as you can answer what is happening on the card when see it lock up with hard artifacting? Not even the root cause, but just what failure signifies having occurred? Because even that is significant, as I don't think anyone has been able to say on a single card what it means is happening.

Will know after testing.

Nimisys · Feb 16, 2019

Got it. When you do find out, please share.

Aireoth · Feb 16, 2019

Nimisys said:
Got it. When you do find out, please share.

Naw, I think he is going to keep it to himself, there in no way is going to be a big front page article

FrgMstr · Feb 18, 2019

Update:

Hey Kyle,

We are still working away on the first board and are able to consistently produce the error. No rush on finding a second board at this point, we aren’t quite sure how we will proceed with testing:

Our thermal labs finished their testing Friday afternoon. 14 separate controlled cold boot trials were conducted throughout the testing and we were unable to corelate ambient or GPU temperatures with the issue. Artifacts were present in 50% of our tests and would always occur within less than a minute of running a 3D application.

Next step: There is some back and forth debate about how to proceed, both options to proceed with testing are destructive. I should have a better idea by mid-week. We are on a long weekend up here, so it’s been difficult to get some of our engineers time.

Maddness · Feb 18, 2019

So it's not temp related.

Interesting that the next step is destructive. Could that mean removal of some parts such as memory chips or the GPU core for testing.

fuzzylogik · Feb 18, 2019

Dang, well it's starting to look like something... significant... is borked on the cards if it's not a 'simple' thing like temperature and may need a more destructive look at the card. Thinking Nvidia sorta rolled the dice on this problem popping up in any way meaningful and rather than recalling cards they just quietly worked on a fix? eh - who knows.

Legendary Gamer · Feb 18, 2019

I'm just going to wait for the results patiently. Kyle and the people he has working on this can take all the time necessary to figure this stuff out. At this point we've all speculated what might be the cause of the issues. I'm not expecting there to be a smoking gun at this point, however, my hope is that this might open some damn eyes and make a difference for the better.

I have these bizarre issues with my card that don't outright tell me it's failing but I don't believe it's stable either. I've had the occasional system lock, wierd static screens and pixelating issues in web browsers as well as oddities in screen blanking, video playback and more. Some can be blamed on drivers. The rest has to be the damn silicon. Because, toss in any previous gen Nvidia part and all my issues disappear instantly.

This entire search for the truth is why I will always place my faith in the [H]. Not many people have the guts or perseverance to follow anything through to conclusion these days.

Formula.350 · Feb 18, 2019

Legendary Gamer said:
I have these bizarre issues with my card that don't outright tell me it's failing but I don't believe it's stable either. I've had the occasional system lock, wierd static screens and pixelating issues in web browsers as well as oddities in screen blanking, video playback and more. Some can be blamed on drivers. The rest has to be the damn silicon. Because, toss in any previous gen Nvidia part and all my issues disappear instantly.

Adding a little more speculation to the fire...

It'd be definitely interesting if it somehow ended up being something simple like not enough power at low clocks, both for your case and the Space Invaders cards. Such as the transition from 2D clocks to 3D clocks, and/or the slow activation of the additional power phases, which induce the artifacting lockup. So it could very well be legitimate Test Escapes and it's just substandard MOSFETs or Chokes that oscillate and cause the instability at low clocks.

But what I did find rather interesting was that they felt that another Space Invaders sample card was not needed... Edit: Well, ok, technically it didn't say one wasn't needed full-stop, but I'd still think that with a sample size of 1 that it'd be a case of more-the-merrier. So their comment is still curious either way.

AFTER-THOUGHT: Legendary Gamer - If you get bored and haven't tried it already, maybe try forcing 3D clocks all the time, to see if your "2D" hiccups persist. It's a simple enough test, which at least doesn't require dissecting your card to determine! lol

Legendary Gamer · Feb 18, 2019

Formula.350 said:
Adding a little more speculation to the fire...

It'd be definitely interesting if it somehow ended up being something simple like not enough power at low clocks, both for your case and the Space Invaders cards. Such as the transition from 2D clocks to 3D clocks, and/or the slow activation of the additional power phases, which induce the artifacting lockup. So it could very well be legitimate Test Escapes and it's just substandard MOSFETs or Chokes that oscillate and cause the instability at low clocks.

But what I did find rather interesting was that they felt that another Space Invaders sample card was not needed... Edit: Well, ok, technically it didn't say one wasn't needed full-stop, but I'd still think that with a sample size of 1 that it'd be a case of more-the-merrier. So their comment is still curious either way.

AFTER-THOUGHT: Legendary Gamer - If you get bored and haven't tried it already, maybe try forcing 3D clocks all the time, to see if your "2D" hiccups persist. It's a simple enough test, which at least doesn't require dissecting your card to determine! lol

That's a good idea, I won't have to run the 1500 watt space heater in my den anymore either

. Worth a shot, I will see if I can light it on fire tonight

Formula.350 · Feb 18, 2019

Legendary Gamer said:
That's a good idea, I won't have to run the 1500 watt space heater in my den anymore either . Worth a shot, I will see if I can light it on fire tonight

haha Well, thankfully, heat is generated as a result of load, which would be volts x amps. However, even then there's efficiency at play, as well as Watts of power not equating to Watts of "heat" (BTUs).

Either way, at desktop "load" there won't be a need for too much amperage, even at full speed, so I wouldn't imagine the heat output would raise too considerably much. Your OCed Ryzen as an example, assuming it isn't OCed via PStates and still utilizing Cool n' Quiet, stays rather cool despite running full clocks.

Alas, I do understand what you meant either way, as I grew up in Minnesota and my basement bedroom had an electric baseboard heater. However, I never used it in all the years I lived in that room because I kept my computer running 24/7 (Athlon Classic 550MHz). When I built my AthlonMP system a couple years later, I actually needed to keep my door open or else my room would get so hot (even in winter) that I'D be overheating! lol

Nimisys · Mar 1, 2019

Any update on this?

Legendary Gamer · Mar 1, 2019

Nimisys said:
Any update on this?

This stuff takes time. You know there's going to be a Discovery article all about it once all the findings are in, regardless of what is actually discovered.

The issues with the 20 series have been so elusive I imagine it's not an easy analysis. Like you, that question is always on my mind but I gotta let the professionals do their thing without bugging the shit out of them

FrgMstr · Mar 1, 2019

Nimisys said:
Any update on this?

We were getting the card through a bunch more tests, xray etc., before moving to destructive testing.

Nothing but theories at this point, no hard data to go on.

The lab has been investigating if it is a memory training issue at boot, while some board layout folks are looking closer at our board x-rays of the PCB layout (memory tracings).

I was hopeful that we could discover a root cause while keeping the board functional, but I am starting to think that I may need to send the board off to our imaging labs for destructive analysis next week. The test procedures that keep the board functional are not giving us consistent or actionable data…

That all said, I have been getting some back-channel communications from AIBs, which is NOT easy to come by on this, as to why these cards are failing. NVIDIA is putting the hammer down on all these folks to be very quiet about this. However, we have gotten a few comms on this as to what they are seeing, but it is up to up to see if this card is sharing those same issues. Those theories will not be proven without destructive testings, and we are just about to that avenue.

fuzzylogik · Mar 1, 2019

Very cool.

Still sucks that this is even a thing that needs to be done :/

RTX Space Invaders Wanted

[H]F Junkie

Supreme [H]ardness

[H]F Junkie

Just Plain Mean

[H]ard|Gawd

Supreme [H]ardness

[H]F Junkie

[H]ard|Gawd

Space Invaders High Score Champion

Just Plain Mean

Fully [H]

2[H]4U

[H]F Junkie

Gawd

Just Plain Mean

2[H]4U

[H]F Junkie

Just Plain Mean

Supreme [H]ardness

Just Plain Mean

Just Plain Mean

Supreme [H]ardness

Just Plain Mean

Didn't STFU

Fully [H]

Fully [H]

Just Plain Mean

Fully [H]

Supreme [H]ardness

Just Plain Mean

[H]ard|Gawd

Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Fully [H]

2[H]4U

Just Plain Mean

Gawd