What's 2 + 2? Don't Ask the Titan V

Discussion in '[H]ard|OCP Front Page News' started by rgMekanic, Mar 21, 2018.

  1. rgMekanic

    rgMekanic [H]ard|News Staff Member

    Messages:
    5,059
    Joined:
    May 13, 2013
    The Register is reporting that Nvidia's Volta based flagship, the Titan V, may have some hardware gremlins that cause them to give different answers to repeated scientific calculations. One engineer told The Register that when running identical simulations of an interaction between a protein and an enzyme the results varied. After repeated tests on four Titan Vs, he found that two of the cards gave numerical errors about 10% of the time.

    Shockingly, when The Register repeatedly asked Nvidia for an explanation, they declined to comment. Not something you want from a card marketed for such tasks, but hey, at least it plays games really well.

    All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions with the virtual world, not rare glitches in the underlying hardware.
     
  2. sadsteve

    sadsteve Limp Gawd

    Messages:
    421
    Joined:
    Oct 1, 2010
    Heh, it's as if the GPUs are using slide rules to do there calculations! :)
     
    Chupachup likes this.
  3. WhoMe

    WhoMe Gawd

    Messages:
    737
    Joined:
    Jan 3, 2018
    Why then they should call it a quantum card and say it's a feature.
     
  4. Aireoth

    Aireoth [H]ard|Gawd

    Messages:
    1,208
    Joined:
    Oct 12, 2005
    nVidia's response is right in line so far.

    giphy.gif
     
    tetris42, N4CR, lostin3d and 5 others like this.
  5. ir0nw0lf

    ir0nw0lf [H]ardness Supreme

    Messages:
    6,009
    Joined:
    Feb 7, 2003
    Which will be quickly followed with "We have moved on from this story." :LOL:
     
    horskh, lostin3d, Willypants and 14 others like this.
  6. Sparky

    Sparky 2[H]4U

    Messages:
    3,126
    Joined:
    Mar 9, 2000
    I knew it wasnt me when I get killed.
     
  7. jpm100

    jpm100 [H]ardness Supreme

    Messages:
    6,970
    Joined:
    Oct 31, 2004
    graphics cards are for graphics.
     
    Stryker7314 likes this.
  8. Dekar12

    Dekar12 Gawd

    Messages:
    734
    Joined:
    Oct 2, 2003
    Wouldn't this also affect mining?
     
    Stryker7314, motomonkey and rgMekanic like this.
  9. cageymaru

    cageymaru [H]ard|News

    Messages:
    17,938
    Joined:
    Apr 10, 2003
  10. rgMekanic

    rgMekanic [H]ard|News Staff Member

    Messages:
    5,059
    Joined:
    May 13, 2013
  11. Krazy925

    Krazy925 epeen +10

    Messages:
    2,520
    Joined:
    Sep 29, 2012
    Maybe AMD can capitalize on this.

    They could always use a win.
     
  12. bugleyman

    bugleyman [H]ard|Gawd

    Messages:
    1,041
    Joined:
    Oct 27, 2010
    Good. At this point NVidia needs to be knocked down a peg. They've gotten too big for their britches.
     
  13. thesmokingman

    thesmokingman [H]ardness Supreme

    Messages:
    4,838
    Joined:
    Nov 22, 2008
    Well that's going to sting a bit.
     
    Stryker7314 likes this.
  14. Formula.350

    Formula.350 Gawd

    Messages:
    910
    Joined:
    Sep 30, 2011
    Now if only one could determine if somehow this was a 'feature' of nV's recent architecture and somehow plays a roll in their ability to churn out faster framerates. I'd love nothing more than to spin this into a tale of how they are once more misleading everyone with nefarious tactics , but I cannot spin it that way since in all honesty... to me that would mean they'd need to be also outputting visuals that didn't meet what the dev's created, which would clearly show up in the testing phase. Certainly some devs are using these in their game creation work.

    Or, perhaps nV new this all along and is why the only card available on Volta arch that can output graphically (as I understand it, Tesla cards lack any video-out ports), are these TITAN V's, at a price to deter gamers from using? Now then... where's my tinfoil hat! :bag:

    That being said, it WILL be quite interesting to see if that does somehow play a roll in that accident. Also with exactly how far back this 'glitch' stems in the architecture, as it could spell tons of trouble for any project that are running on Volta-based hardware...
    On the flip-side, perhaps that's why there are Volta-based TITAN cards in the first place? Faulty chips that didn't make the grade to become a Telsa, AI, or Automotive product? If so, then that may save them from the above, and simply mean they need to revise their tests to brand them a TITAN V. Also, I didn't notice whether these were running on Cuda or Tensor, but based on that it was said the previous generations weren't exhibiting this, I'll assume Cuda...
     
    dragonstongue likes this.
  15. BloodyIron

    BloodyIron 2[H]4U

    Messages:
    2,437
    Joined:
    Jul 11, 2005
    How to kill confidence in your expensive products.
     
  16. chaos4u

    chaos4u Limp Gawd

    Messages:
    274
    Joined:
    Dec 1, 2004
    i was always under the impression this was common knowledge thus why we have cpus and gpus.

    cpus do the hard maths that need reliability while gpu crunch the numbers that dont have to be so repeatable and precise.

    i always thought the gpus was better at the rainbow tables and the hashing just because the problems was a better fit to how they did their calculations . these calculations were great at crunching large data sets but not that great at getting repeatable and super precise results.

    i know that sounds odd as. i have trouble understanding why that is numbers are numbers to me . but apparently in the computer world there is shortcuts and gpus shortcut them selves around complex problems and cpus brute force their way around problems.

    im sure their is a better explanation .

    but i would think these scientist would know this stuff before choosing this hardware for such problems.
     
  17. CuriousGeorge

    CuriousGeorge n00bie

    Messages:
    33
    Joined:
    Apr 8, 2012
    I do scientific GPU compute as a part of my job, and yes, even in double precision mode GPU and CPU code will give you different answers; anecdotally I'd say it's because CPUs are actually working at slightly higher precision behind the scenes, so the numerical error in GPU compute is going to be a bit higher. That means that if your algorithm is susceptible to numerical error you'll run into problems sooner on a GPU than a CPU - but if that's a problem it's mostly on you for having a bad algorithm. That said, any GPU calculation should be just as repeatable as any CPU calculation. If it's not, you might as well throw the GPU in the trash and just ask a random number generator to give you an answer.

    Given the error rate I would speculate that this will probably end up being fixable in software, though. Likely something akin to an operation being done in single precision mode instead of double precision mode under some rare circumstances/edge cases that haven't been noticed yet.
     
    trandoanhung1991, bbf, mashie and 5 others like this.
  18. Azrak

    Azrak Gawd

    Messages:
    626
    Joined:
    Oct 4, 2015
    Where are CTS Labs and Viceroy when you need them?
     
  19. Nobu

    Nobu [H]ard|Gawd

    Messages:
    1,716
    Joined:
    Jun 7, 2007
    Yep, GPUs aren't CPUs. Whereas you can generally trust a CPU to give a correct result, a GPU may not under certain circumstances. You should code for the possibility of such an error in either case (unless you really don't care, as is the case in many but not all graphics operations), just in case the cosmic rays don't shine in your favor, or you overflow somewhere, or some other unexpected situation arises.
     
  20. dragonstongue

    dragonstongue 2[H]4U

    Messages:
    2,843
    Joined:
    Nov 18, 2008
    premium price without the expected and paid for premium quality.
    but seriously, a professional level graphics card that screws up (from the sounds of it) basic calculations, I though ECC etc etc was designed to prevent such things, let alone Pro spec cards are supposed to have much much higher fault prevention in every way possible (it will cost you of course)

    bet you Nv will find a way to blame it on the users themselves instead of directly admitting fault, or find a what they consider clever way to "spin it" to avoid admitting fault in their product (bump gate/solder issues they blamed on everyone but themselves, 4gb that is actually 3.5gb just to name 2 of the many notable examples I can think of)

    their "performance" obviously comes at a basic functionality cost (simple math should not require any specific grunt work as that is a GPU forte, simple tasks that require constant replication, you cannot trust the ability of replication make them effectively useless if you cannot count on the output being a constant "trusty" result)

    Nvidia "it's in the way you've just been played"
     
    Darth Kyrie likes this.
  21. lollerwaffle

    lollerwaffle Gawd

    Messages:
    543
    Joined:
    Feb 3, 2008
    I can't help but wonder how enjoyable it was to write this headline after the recent article about partner programs ;)

    But on topic, I reserve judgement until NV responds. Perhaps it's just a compiler issue, or a driver issue. That's not unheard of in the CPU world (the compiler issue.)
     
  22. cyklondx

    cyklondx n00bie

    Messages:
    13
    Joined:
    Mar 19, 2018
    but hey its nv, the results are likely right anyway - its just our science thats bad. No way fault of the gpu.

    (the gpu's only give approximation of results just like its mentioned above couple of times - still bad news for scientific calc using nv gpu's.)
     
  23. travm

    travm Limp Gawd

    Messages:
    174
    Joined:
    Feb 26, 2016
    You heard it here first folks, Nvidia kills people!
     
    Stryker7314, Darth Kyrie and Krazy925 like this.
  24. BSmith

    BSmith Gawd

    Messages:
    957
    Joined:
    Nov 9, 2017
    Dayum,.....what the heck is going on over at NVidia?
     
    Stryker7314 likes this.
  25. Chimpee

    Chimpee Gawd

    Messages:
    956
    Joined:
    Jul 6, 2015
    I hold my reservations since it is only a sample size of 1 engineer, so either something gone bonkers with Volta or it is a PEBKAC issue. I am curious why the reporter doesn't go interview more engineers who has hands on experience with Volta to see if this is an issue.

    Anyway, never heard of theregister before, how is their reporting normally?
     
  26. Nytegard

    Nytegard 2[H]4U

    Messages:
    3,060
    Joined:
    Jan 8, 2004
    For a gaming card, I'd say this wouldn't be a big deal. For a scientific card though, precision and reliability matter. Sadly, there's a lot of stuff that's been moved to the GPU to be calculated (not only in science, but in finance, and other things that deal with vectors and matrices). To me though, while I can understand the basics of your basic PC CPU, video cards are extremely confusing to me in how they operate.
     
  27. Anarchist4000

    Anarchist4000 [H]ard|Gawd

    Messages:
    1,651
    Joined:
    Jun 10, 2001
    Poor Volta, not quite Turing, but maybe with a few more Amperes they'll get the correct results.
     
    Darth Kyrie, JDiesel, N4CR and 6 others like this.
  28. Gideon

    Gideon [H]ard|Gawd

    Messages:
    1,553
    Joined:
    Apr 13, 2006
    Oh man that a serious issue if it's repeatable in other things. Be a lot of pissed off researchers as well.
     
    Stryker7314 and Darth Kyrie like this.
  29. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,106
    Joined:
    May 9, 2006
    that makes me feel bad for the guy who has the 4 Nvidia titan V build he posted that made the front pate. That's a 20k investment in errors. UGH.
     
  30. Azphira

    Azphira [H]ard|Gawd

    Messages:
    1,656
    Joined:
    Aug 18, 2003
  31. DukenukemX

    DukenukemX 2[H]4U

    Messages:
    3,682
    Joined:
    Jan 30, 2005
    Impossible, Nvidia are Gods. Haven't you seen the benchmakrs? The best, da best. Clearly the math is wrong.

    /sarcasm
     
    Darth Kyrie likes this.
  32. Derfnofred

    Derfnofred Gawd

    Messages:
    544
    Joined:
    Dec 11, 2009
    Yeah, this looks like something that can be handled, albeit it'll probably kill throughput in that/those specific instructions. Precision (internal and external) should be standardized in IEEE 754 (think this would show up most in a M-ADD instruction), but I'm wondering more if the stuff you're seeing is differences in order of operations.
     
  33. theBrownLlama

    theBrownLlama Limp Gawd

    Messages:
    343
    Joined:
    Aug 3, 2017
    don't junk it, i will buy that defective card for 90% off.
     
    Stryker7314 likes this.
  34. Luke Wells

    Luke Wells n00bie

    Messages:
    23
    Joined:
    Jan 24, 2018
    Well... at least this card is cheap. :shrug:
     
    Stryker7314 likes this.
  35. DejaWiz

    DejaWiz Oracle of Unfortunate Truths

    Messages:
    18,853
    Joined:
    Apr 15, 2005
    How long after it's released until the fire sale? I'd pay a couple...nay, a few...hundred dollars for a pair of them.
     
    Stryker7314 likes this.
  36. N4CR

    N4CR 2[H]4U

    Messages:
    2,548
    Joined:
    Oct 17, 2011
    Nvidia: "You are installing it wrong."
     
  37. schoolslave

    schoolslave Limp Gawd

    Messages:
    492
    Joined:
    Dec 7, 2010
    Could this be ignorance about Floating Point math?
    Also, it doesn't matter whether it's a CPU or GPU; the FP standard is the same.

    From Wikipedia:
    "Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors due to round-off."

    I recommend reading some of this: http://docs.nvidia.com/cuda/floating-point/index.html#floating-point

    Of course, majority of "tech enthusiasts" here don't actually give a shit and just follow the manufactured outrage...
     
    GoldenTiger, Formula.350 and NoOther like this.
  38. Todd Walter

    Todd Walter Limp Gawd

    Messages:
    432
    Joined:
    May 10, 2016
    Of course, majority of "flippant commenters", such as yourself, aren't aware that you don't have to use IEEE 754 compliant calculations unless you compile for them. And even if you do, there's an "Inexact" exception that has to be handled.
     
    motomonkey likes this.
  39. Arcygenical

    Arcygenical Will Watercool for Crack

    Messages:
    25,682
    Joined:
    Jun 10, 2005
    Sooo does nvidia have exclusive rights to the "cant math" moniker under GPP.
     
    Last edited: Mar 22, 2018
    Stryker7314 and Darth Kyrie like this.
  40. schoolslave

    schoolslave Limp Gawd

    Messages:
    492
    Joined:
    Dec 7, 2010
    So instead of having a discussion about the technical aspects of the hardware and what may be wrong, "the enthusiasts" devolve into fanboyism, baseless accusations/assumptions, and claiming that math should be done on CPUs -- all fueled by "news" articles intended to create outrage instead of fostering discussion.

    Also whether you're compiling for IEEE 754 or not, FP math has inherent inaccuracies which need to be handled/accounted for.
    But I'm not here to prove who knows more about FP math or not (I'm certainly no expert, I've just personally seen what happens when you ignore the limitations), I'm just highlighting what I think is a problem with how "news" are presented here.