What's 2 + 2? Don't Ask the Titan V

rgMekanic

[H]ard|News
Joined
May 13, 2013
Messages
6,943
The Register is reporting that Nvidia's Volta based flagship, the Titan V, may have some hardware gremlins that cause them to give different answers to repeated scientific calculations. One engineer told The Register that when running identical simulations of an interaction between a protein and an enzyme the results varied. After repeated tests on four Titan Vs, he found that two of the cards gave numerical errors about 10% of the time.

Shockingly, when The Register repeatedly asked Nvidia for an explanation, they declined to comment. Not something you want from a card marketed for such tasks, but hey, at least it plays games really well.

All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions with the virtual world, not rare glitches in the underlying hardware.
 
nVidia's response is right in line so far.

giphy.gif
 
Wouldn't this also affect mining?

I would take a guess and say most likely it would. Wonder if UBER can use this bug as a reason for the autonomous car running over the bicyclist in Arizona if they are using the newest Nvidia Volta tech under the hood of the car? They unveiled their partnership at CES 2018.

Now if only one could determine if somehow this was a 'feature' of nV's recent architecture and somehow plays a roll in their ability to churn out faster framerates. I'd love nothing more than to spin this into a tale of how they are once more misleading everyone with nefarious tactics , but I cannot spin it that way since in all honesty... to me that would mean they'd need to be also outputting visuals that didn't meet what the dev's created, which would clearly show up in the testing phase. Certainly some devs are using these in their game creation work.

Or, perhaps nV new this all along and is why the only card available on Volta arch that can output graphically (as I understand it, Tesla cards lack any video-out ports), are these TITAN V's, at a price to deter gamers from using? Now then... where's my tinfoil hat! :bag:

That being said, it WILL be quite interesting to see if that does somehow play a roll in that accident. Also with exactly how far back this 'glitch' stems in the architecture, as it could spell tons of trouble for any project that are running on Volta-based hardware...
On the flip-side, perhaps that's why there are Volta-based TITAN cards in the first place? Faulty chips that didn't make the grade to become a Telsa, AI, or Automotive product? If so, then that may save them from the above, and simply mean they need to revise their tests to brand them a TITAN V. Also, I didn't notice whether these were running on Cuda or Tensor, but based on that it was said the previous generations weren't exhibiting this, I'll assume Cuda...
 
i was always under the impression this was common knowledge thus why we have cpus and gpus.

cpus do the hard maths that need reliability while gpu crunch the numbers that dont have to be so repeatable and precise.

i always thought the gpus was better at the rainbow tables and the hashing just because the problems was a better fit to how they did their calculations . these calculations were great at crunching large data sets but not that great at getting repeatable and super precise results.

i know that sounds odd as. i have trouble understanding why that is numbers are numbers to me . but apparently in the computer world there is shortcuts and gpus shortcut them selves around complex problems and cpus brute force their way around problems.

im sure their is a better explanation .

but i would think these scientist would know this stuff before choosing this hardware for such problems.
 
i was always under the impression this was common knowledge thus why we have cpus and gpus.

cpus do the hard maths that need reliability while gpu crunch the numbers that dont have to be so repeatable and precise.

i always thought the gpus was better at the rainbow tables and the hashing just because the problems was a better fit to how they did their calculations . these calculations were great at crunching large data sets but not that great at getting repeatable and super precise results.

i know that sounds odd as. i have trouble understanding why that is numbers are numbers to me . but apparently in the computer world there is shortcuts and gpus shortcut them selves around complex problems and cpus brute force their way around problems.

im sure their is a better explanation .

but i would think these scientist would know this stuff before choosing this hardware for such problems.
I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.
 
I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.
Yep, GPUs aren't CPUs. Whereas you can generally trust a CPU to give a correct result, a GPU may not under certain circumstances. You should code for the possibility of such an error in either case (unless you really don't care, as is the case in many but not all graphics operations), just in case the cosmic rays don't shine in your favor, or you overflow somewhere, or some other unexpected situation arises.
 
premium price without the expected and paid for premium quality.
but seriously, a professional level graphics card that screws up (from the sounds of it) basic calculations, I though ECC etc etc was designed to prevent such things, let alone Pro spec cards are supposed to have much much higher fault prevention in every way possible (it will cost you of course)

bet you Nv will find a way to blame it on the users themselves instead of directly admitting fault, or find a what they consider clever way to "spin it" to avoid admitting fault in their product (bump gate/solder issues they blamed on everyone but themselves, 4gb that is actually 3.5gb just to name 2 of the many notable examples I can think of)

their "performance" obviously comes at a basic functionality cost (simple math should not require any specific grunt work as that is a GPU forte, simple tasks that require constant replication, you cannot trust the ability of replication make them effectively useless if you cannot count on the output being a constant "trusty" result)

Nvidia "it's in the way you've just been played"
 
I can't help but wonder how enjoyable it was to write this headline after the recent article about partner programs ;)

But on topic, I reserve judgement until NV responds. Perhaps it's just a compiler issue, or a driver issue. That's not unheard of in the CPU world (the compiler issue.)
 
but hey its nv, the results are likely right anyway - its just our science thats bad. No way fault of the gpu.

(the gpu's only give approximation of results just like its mentioned above couple of times - still bad news for scientific calc using nv gpu's.)
 
I hold my reservations since it is only a sample size of 1 engineer, so either something gone bonkers with Volta or it is a PEBKAC issue. I am curious why the reporter doesn't go interview more engineers who has hands on experience with Volta to see if this is an issue.

Anyway, never heard of theregister before, how is their reporting normally?
 
For a gaming card, I'd say this wouldn't be a big deal. For a scientific card though, precision and reliability matter. Sadly, there's a lot of stuff that's been moved to the GPU to be calculated (not only in science, but in finance, and other things that deal with vectors and matrices). To me though, while I can understand the basics of your basic PC CPU, video cards are extremely confusing to me in how they operate.
 
that makes me feel bad for the guy who has the 4 Nvidia titan V build he posted that made the front pate. That's a 20k investment in errors. UGH.
 
Impossible, Nvidia are Gods. Haven't you seen the benchmakrs? The best, da best. Clearly the math is wrong.

/sarcasm
 
I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.

Yeah, this looks like something that can be handled, albeit it'll probably kill throughput in that/those specific instructions. Precision (internal and external) should be standardized in IEEE 754 (think this would show up most in a M-ADD instruction), but I'm wondering more if the stuff you're seeing is differences in order of operations.
 
How long after it's released until the fire sale? I'd pay a couple...nay, a few...hundred dollars for a pair of them.
 
Could this be ignorance about Floating Point math?
Also, it doesn't matter whether it's a CPU or GPU; the FP standard is the same.

From Wikipedia:
"Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off."

I recommend reading some of this: http://docs.nvidia.com/cuda/floating-point/index.html#floating-point

Of course, majority of "tech enthusiasts" here don't actually give a shit and just follow the manufactured outrage...
 
Could this be ignorance about Floating Point math?
Also, it doesn't matter whether it's a CPU or GPU; the FP standard is the same.

From Wikipedia:
"Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off."

I recommend reading some of this: http://docs.nvidia.com/cuda/floating-point/index.html#floating-point

Of course, majority of "tech enthusiasts" here don't actually give a shit and just follow the manufactured outrage...

Of course, majority of "flippant commenters", such as yourself, aren't aware that you don't have to use IEEE 754 compliant calculations unless you compile for them. And even if you do, there's an "Inexact" exception that has to be handled.
 
Of course, majority of "flippant commenters", such as yourself, aren't aware that you don't have to use IEEE 754 compliant calculations unless you compile for them. And even if you do, there's an "Inexact" exception that has to be handled.

So instead of having a discussion about the technical aspects of the hardware and what may be wrong, "the enthusiasts" devolve into fanboyism, baseless accusations/assumptions, and claiming that math should be done on CPUs -- all fueled by "news" articles intended to create outrage instead of fostering discussion.

Also whether you're compiling for IEEE 754 or not, FP math has inherent inaccuracies which need to be handled/accounted for.
But I'm not here to prove who knows more about FP math or not (I'm certainly no expert, I've just personally seen what happens when you ignore the limitations), I'm just highlighting what I think is a problem with how "news" are presented here.
 
Back
Top