What's 2 + 2? Don't Ask the Titan V

rgMekanic · Mar 21, 2018

The Register is reporting that Nvidia's Volta based flagship, the Titan V, may have some hardware gremlins that cause them to give different answers to repeated scientific calculations. One engineer told The Register that when running identical simulations of an interaction between a protein and an enzyme the results varied. After repeated tests on four Titan Vs, he found that two of the cards gave numerical errors about 10% of the time.

Shockingly, when The Register repeatedly asked Nvidia for an explanation, they declined to comment. Not something you want from a card marketed for such tasks, but hey, at least it plays games really well.

All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions with the virtual world, not rare glitches in the underlying hardware.

sadsteve · Mar 21, 2018

Heh, it's as if the GPUs are using slide rules to do there calculations!

WhoMe · Mar 21, 2018

Why then they should call it a quantum card and say it's a feature.

Aireoth · Mar 21, 2018

nVidia's response is right in line so far.

ir0nw0lf · Mar 21, 2018

Aireoth said:
nVidia's response is right in line so far.

Which will be quickly followed with "We have moved on from this story."

Sparky · Mar 21, 2018

I knew it wasnt me when I get killed.

Jagger100 · Mar 21, 2018

graphics cards are for graphics.

Dekar12 · Mar 21, 2018

Wouldn't this also affect mining?

cageymaru · Mar 21, 2018

Dekar12 said:
Wouldn't this also affect mining?

I would take a guess and say most likely it would. Wonder if UBER can use this bug as a reason for the autonomous car running over the bicyclist in Arizona if they are using the newest Nvidia Volta tech under the hood of the car? They unveiled their partnership at CES 2018.

rgMekanic · Mar 21, 2018

cageymaru said:
I would take a guess and say most likely it would. Wonder if UBER can use this bug as a reason for the autonomous car running over the bicyclist in Arizona if they are using the newest Nvidia Volta tech under the hood of the car? They unveiled their partnership at CES 2018.

oooh, that'd fuck with the stock prices

Krazy925 · Mar 21, 2018

Maybe AMD can capitalize on this.

They could always use a win.

bugleyman · Mar 21, 2018

Good. At this point NVidia needs to be knocked down a peg. They've gotten too big for their britches.

thesmokingman · Mar 21, 2018

Well that's going to sting a bit.

Formula.350 · Mar 21, 2018

Dekar12 said:
Wouldn't this also affect mining?

cageymaru said:
I would take a guess and say most likely it would. Wonder if UBER can use this bug as a reason for the autonomous car running over the bicyclist in Arizona if they are using the newest Nvidia Volta tech under the hood of the car? They unveiled their partnership at CES 2018.

Now if only one could determine if somehow this was a 'feature' of nV's recent architecture and somehow plays a roll in their ability to churn out faster framerates. I'd love nothing more than to spin this into a tale of how they are once more misleading everyone with nefarious tactics , but I cannot spin it that way since in all honesty... to me that would mean they'd need to be also outputting visuals that didn't meet what the dev's created, which would clearly show up in the testing phase. Certainly some devs are using these in their game creation work.

Or, perhaps nV new this all along and is why the only card available on Volta arch that can output graphically (as I understand it, Tesla cards lack any video-out ports), are these TITAN V's, at a price to deter gamers from using? Now then... where's my tinfoil hat!

That being said, it WILL be quite interesting to see if that does somehow play a roll in that accident. Also with exactly how far back this 'glitch' stems in the architecture, as it could spell tons of trouble for any project that are running on Volta-based hardware...
On the flip-side, perhaps that's why there are Volta-based TITAN cards in the first place? Faulty chips that didn't make the grade to become a Telsa, AI, or Automotive product? If so, then that may save them from the above, and simply mean they need to revise their tests to brand them a TITAN V. Also, I didn't notice whether these were running on Cuda or Tensor, but based on that it was said the previous generations weren't exhibiting this, I'll assume Cuda...

BloodyIron · Mar 21, 2018

How to kill confidence in your expensive products.

chaos4u · Mar 21, 2018

i was always under the impression this was common knowledge thus why we have cpus and gpus.

cpus do the hard maths that need reliability while gpu crunch the numbers that dont have to be so repeatable and precise.

i always thought the gpus was better at the rainbow tables and the hashing just because the problems was a better fit to how they did their calculations . these calculations were great at crunching large data sets but not that great at getting repeatable and super precise results.

i know that sounds odd as. i have trouble understanding why that is numbers are numbers to me . but apparently in the computer world there is shortcuts and gpus shortcut them selves around complex problems and cpus brute force their way around problems.

im sure their is a better explanation .

but i would think these scientist would know this stuff before choosing this hardware for such problems.

CuriousGeorge · Mar 21, 2018

chaos4u said:
i was always under the impression this was common knowledge thus why we have cpus and gpus.

cpus do the hard maths that need reliability while gpu crunch the numbers that dont have to be so repeatable and precise.

i always thought the gpus was better at the rainbow tables and the hashing just because the problems was a better fit to how they did their calculations . these calculations were great at crunching large data sets but not that great at getting repeatable and super precise results.

i know that sounds odd as. i have trouble understanding why that is numbers are numbers to me . but apparently in the computer world there is shortcuts and gpus shortcut them selves around complex problems and cpus brute force their way around problems.

im sure their is a better explanation .

but i would think these scientist would know this stuff before choosing this hardware for such problems.

I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.

Azrak · Mar 21, 2018

Where are CTS Labs and Viceroy when you need them?

Nobu · Mar 21, 2018

CuriousGeorge said:
I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.

Yep, GPUs aren't CPUs. Whereas you can generally trust a CPU to give a correct result, a GPU may not under certain circumstances. You should code for the possibility of such an error in either case (unless you really don't care, as is the case in many but not all graphics operations), just in case the cosmic rays don't shine in your favor, or you overflow somewhere, or some other unexpected situation arises.

dragonstongue · Mar 21, 2018

premium price without the expected and paid for premium quality.
but seriously, a professional level graphics card that screws up (from the sounds of it) basic calculations, I though ECC etc etc was designed to prevent such things, let alone Pro spec cards are supposed to have much much higher fault prevention in every way possible (it will cost you of course)

bet you Nv will find a way to blame it on the users themselves instead of directly admitting fault, or find a what they consider clever way to "spin it" to avoid admitting fault in their product (bump gate/solder issues they blamed on everyone but themselves, 4gb that is actually 3.5gb just to name 2 of the many notable examples I can think of)

their "performance" obviously comes at a basic functionality cost (simple math should not require any specific grunt work as that is a GPU forte, simple tasks that require constant replication, you cannot trust the ability of replication make them effectively useless if you cannot count on the output being a constant "trusty" result)

Nvidia "it's in the way you've just been played"

lollerwaffle · Mar 21, 2018

I can't help but wonder how enjoyable it was to write this headline after the recent article about partner programs

But on topic, I reserve judgement until NV responds. Perhaps it's just a compiler issue, or a driver issue. That's not unheard of in the CPU world (the compiler issue.)

cyklondx · Mar 21, 2018

but hey its nv, the results are likely right anyway - its just our science thats bad. No way fault of the gpu.

(the gpu's only give approximation of results just like its mentioned above couple of times - still bad news for scientific calc using nv gpu's.)

travm · Mar 21, 2018

rgMekanic said:
oooh, that'd fuck with the stock prices

You heard it here first folks, Nvidia kills people!

BSmith · Mar 21, 2018

Dayum,.....what the heck is going on over at NVidia?

Chimpee · Mar 21, 2018

I hold my reservations since it is only a sample size of 1 engineer, so either something gone bonkers with Volta or it is a PEBKAC issue. I am curious why the reporter doesn't go interview more engineers who has hands on experience with Volta to see if this is an issue.

Anyway, never heard of theregister before, how is their reporting normally?

Nytegard · Mar 21, 2018

For a gaming card, I'd say this wouldn't be a big deal. For a scientific card though, precision and reliability matter. Sadly, there's a lot of stuff that's been moved to the GPU to be calculated (not only in science, but in finance, and other things that deal with vectors and matrices). To me though, while I can understand the basics of your basic PC CPU, video cards are extremely confusing to me in how they operate.

Anarchist4000 · Mar 21, 2018

Poor Volta, not quite Turing, but maybe with a few more Amperes they'll get the correct results.

Gideon · Mar 21, 2018

Oh man that a serious issue if it's repeatable in other things. Be a lot of pissed off researchers as well.

Grimlaking · Mar 22, 2018

that makes me feel bad for the guy who has the 4 Nvidia titan V build he posted that made the front pate. That's a 20k investment in errors. UGH.

Azphira · Mar 22, 2018

DukenukemX · Mar 22, 2018

Impossible, Nvidia are Gods. Haven't you seen the benchmakrs? The best, da best. Clearly the math is wrong.

/sarcasm

Derfnofred · Mar 22, 2018

CuriousGeorge said:
I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.

Yeah, this looks like something that can be handled, albeit it'll probably kill throughput in that/those specific instructions. Precision (internal and external) should be standardized in IEEE 754 (think this would show up most in a M-ADD instruction), but I'm wondering more if the stuff you're seeing is differences in order of operations.

theBrownLlama · Mar 22, 2018

don't junk it, i will buy that defective card for 90% off.

Luke Wells · Mar 22, 2018

Well... at least this card is cheap. :shrug:

DejaWiz · Mar 22, 2018

How long after it's released until the fire sale? I'd pay a couple...nay, a few...hundred dollars for a pair of them.

N4CR · Mar 22, 2018

Nvidia: "You are installing it wrong."

schoolslave · Mar 22, 2018

Could this be ignorance about Floating Point math?
Also, it doesn't matter whether it's a CPU or GPU; the FP standard is the same.

From Wikipedia:
"Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off."

I recommend reading some of this: http://docs.nvidia.com/cuda/floating-point/index.html#floating-point

Of course, majority of "tech enthusiasts" here don't actually give a shit and just follow the manufactured outrage...

Todd Walter · Mar 22, 2018

schoolslave said:
Could this be ignorance about Floating Point math?
Also, it doesn't matter whether it's a CPU or GPU; the FP standard is the same.

From Wikipedia:
"Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off."

I recommend reading some of this: http://docs.nvidia.com/cuda/floating-point/index.html#floating-point

Of course, majority of "tech enthusiasts" here don't actually give a shit and just follow the manufactured outrage...

Of course, majority of "flippant commenters", such as yourself, aren't aware that you don't have to use IEEE 754 compliant calculations unless you compile for them. And even if you do, there's an "Inexact" exception that has to be handled.

Arcygenical · Mar 22, 2018

Sooo does nvidia have exclusive rights to the "cant math" moniker under GPP.

schoolslave · Mar 22, 2018

Todd Walter said:
Of course, majority of "flippant commenters", such as yourself, aren't aware that you don't have to use IEEE 754 compliant calculations unless you compile for them. And even if you do, there's an "Inexact" exception that has to be handled.

So instead of having a discussion about the technical aspects of the hardware and what may be wrong, "the enthusiasts" devolve into fanboyism, baseless accusations/assumptions, and claiming that math should be done on CPUs -- all fueled by "news" articles intended to create outrage instead of fostering discussion.

Also whether you're compiling for IEEE 754 or not, FP math has inherent inaccuracies which need to be handled/accounted for.
But I'm not here to prove who knows more about FP math or not (I'm certainly no expert, I've just personally seen what happens when you ignore the limitations), I'm just highlighting what I think is a problem with how "news" are presented here.

What's 2 + 2? Don't Ask the Titan V

[H]ard|News

Gawd

Gawd

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Gawd

Fully [H]

[H]ard|News

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

2[H]4U

Limp Gawd

n00b

[H]ard|Gawd

[H]F Junkie

2[H]4U

Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

Supreme [H]ardness

Gawd

Gawd

n00b

Fully [H]

Supreme [H]ardness

[H]ard|Gawd

Gawd

Fully [H]

[H]ard|Gawd