GPU QRB analysis

Discussion in 'Distributed Computing' started by tear, Oct 30, 2012.

  1. tear

    tear [H]ard|DCer of the Year 2011

    Messages:
    1,568
    Joined:
    Jul 25, 2011
    I've promised analysis but got carried away a bit in a conversation with fellow team member
    so that will need to wait... (sorry).

    I did get permission to quote our conversation so, please, enjoy.

    A word of caution. I will ask moderators to intervene shall the discussion depart to the level
    of e-peen contest even in the slightest.

    Code:
    04:32 <&tear> kendrak, to put it short
    04:32 <&tear> http://www.scalalife.eu/content/gpu-only-execution-gromacs
    04:32 <&tear> open this.
    04:32 <&tear> there's a chart close to the bottom of the page
    04:33 <@kendrak> Ok.....
    04:33 <&tear> now, parse this paragraph:
    04:34 <&tear> MD algorithms are complex, and although the Gromacs code is highly tuned for them, they often do not translate very well onto the streaming architetures. 
                  Realistic expectations about the achievable speed-up from tests with GTX280: for small protein systems in implicit solvent using all-vs-all kernels the 
                  acceleration can be as high as 20 times, but in most other setups involving cutoffs and PME the acceleration is usually only about 5 times relative to a 3GHz CPU 
                  core. 
    04:35 <@kendrak> Core as in singular
    04:35 <@kendrak> Not a 48 core boxen
    04:35 <&tear> yes
    04:35 <&tear> so the chart illustrates these two types of simulations
    04:36 <&tear> first three simulations illustrate the '20x' from the paragraph
    04:36 <@kendrak> Yes, get that
    04:36 <&tear> the last two -- the '5x'
    04:36 <@kendrak> I see that
    04:36 <&tear> now
    04:37 <&tear> all GPU projects out there are of 20x type.
    04:37 <&tear> there is no GPU unit that I know of that is of 5x type
    04:37 <&tear> (simplifying)
    04:37 <@kendrak> Not a supprise
    04:37 <&tear> at least not in circulation at this time
    04:38 <&tear> now
    04:40 <&tear> similarly, all SMP projects I've dealt with are of '5x' type
    04:41 <&tear> (that will become interesting later)
    04:41 <@kendrak> Ok.....
    04:41 <&tear> but for now, let's assume that we have a mix of 5x and 20x projects
    04:41 <&tear> and that both types are being served to both GPUs and CPUs
    04:41 <&tear> ==
    04:41 <@kendrak> Talk about a PPD rollarcoaster
    04:41 <&tear> if PG uses SMP as a 'global' benchmark
    04:42 <@kendrak> Ok....
    04:42 <&tear> that means the SMP rigs get relatively flat PPD output across the board
    04:42 <&tear> (for both '5x' and '20x')
    04:42 <&tear> (a consequence of SMP being a benchmark)
    04:42 <&tear> but... GPUs get ridiculuously high points for 20x units
    04:42 <@kendrak> Understand
    04:42 <&tear> because they calculate them so well
    04:43 <&tear> that creates an incentive for GPU users to cherry-pick WUs
    04:43 <&tear> =====
    04:43 <@kendrak> That will get messy quick
    04:44 <&tear> now, imagine they do the opposite
    04:44 <&tear> use GPU as a baseline
    04:44 <&tear> GPU PPD output becomes ~flat
    04:44 <&tear> but some units (20x units) get penalized on SMP
    04:45 <&tear> == create incentive for SMP to only fold 5x units [!]
    04:46 <&tear> with these two types of units
    04:46 <&tear> it's impossible to satisfy these two at the same time:
    04:46 <&tear> 1. Equal points for equal work
    04:46 <&tear> 2. ~Flat PPD output for a given hardware
    04:46 <&tear> ====
    04:46 <&tear> I would honestly recommend keeping '20x' units to GPUs
    04:47 <&tear> and keeping 5x to both SMP and GPUs
    04:47 <&tear> ===
    04:48 <@kendrak> So a 48 core system at 2.0ghz.....
    04:48 <&tear> 20x units scale very bad on SMP, too.
    04:49 <&tear> 2.4 GHz 12 core == TPF of 15m and some seconds (on an 8057 -- '20x' unit)
    04:49 <&tear> 3.0 GHz 48 core == TPF of 8m 
    04:49 <&tear> (on the same unit)
    04:49 <&tear> (even less reason to keep them on CPUs)
    04:51 <@kendrak> I understand the points tear
    04:52 <&tear> kendrak, now, what's even more funny
    04:52 <@kendrak> However.... the community will only want to do20x GPU
    04:52 <&tear> kendrak, not if points are flat on the GPU
    04:53 <&tear> kendrak, but wait for this...
    04:53  * kendrak waits for it
    04:53 <&tear> kendrak, I'm getting 6k PPD on 8057 on my desktop
    04:53 <&tear> kendrak, regular SMP units are getting 26k PPD
    04:54 <&tear> imagine what PPD GPUs would be getting now
    04:54 <&tear> if points were such that I'd be getting 26k PPD on 8057, not 6k :)
    04:54 <&tear> the gap is so BIG that it just doesn't make sense to feed these 20x units to SMP _at_ _all_.
    04:56 <&tear> points would be even higher than the 50-250k they're getting now [!]
    04:57 <@kendrak> Or do they just hand out points like penny candy?
    04:57 <@kendrak> If they keep the WU on "optimized" hardware....
    04:57 <&tear> I've no idea what their intents are
    04:57 <&tear> or if they realize consequences..
    04:58 <&tear> hell, they are researchers so they sure should know what performance difference is on  '20x' type
    04:58 <@kendrak> Only GPUs running 20x WU will be the focus
    04:59 <&tear> the fact that we only have '20x' type units on the GPU (only one with bonus)
    04:59 <@kendrak> Not sure they comprehend what giving 2/3rds the ppd to a gtx 580 vs a 4P does to their SMP potential
    04:59 <&tear> could lead some to conspiracy theories :)
    04:59 <@kendrak> Could.....
    04:59 <&tear> ah, yes, that's another interesting consequence!
    04:59 <@kendrak> Very diplomatic of you
    04:59 <&tear> I'm trying :)
    05:00 <&tear> now, look at this
    05:00 <&tear> top dawg folding 8057 gets 250k PPD
    05:00 <@kendrak> Mmm hum
    05:00 <&tear> with power consumption of about 250W card + 100W system -- 350W
    05:00 <&tear> +/-
    05:01 <&tear> we're talking ballpark figures so we don't need to be super accurate
    05:01 <&tear> that's 714 PPD/Watt
    05:01 <@kendrak> About where a good 4P sits
    05:01 <&tear> not exactly
    05:01 <&tear> a _very_ good 4P gets about 500k PPD with 8101 (and 800W of power)
    05:02 <&tear> that's 625 PPD/Watt [!]
    05:03 <@kendrak> So we have GPUs as a new PPD/w king (in general)
    05:03 <&tear> so 4P 8101 folders (read: majority) are getting screwed by lucky/cherry-picking GPU folders
    05:03 <&tear> 8102s are better but only a little bit.
    05:04 <@kendrak> Then the luck/cherry goes away and it becomes the norm
    05:04 <@kendrak> If only 20x WU are given to GPUs
    05:04 <&tear> correct.
    05:05 <&tear> ==
    05:05 <@kendrak> And SMP will go dark (for the most part)
    05:05 <&tear> yup, completely dark.
    05:05 <&tear> there will be no incentive for "regular SMP" any more
    05:05 <&tear> forget all your 2700k, 3960X ...
    05:06 <@kendrak> And the science is limited, because only one type of math is worth doing
    05:06 <&tear> kendrak, not really
    05:06 <&tear> kendrak, the rest is '5x'
    05:06 <&tear> kendrak, so it's still very good on GPU
    05:07 <&tear> kendrak, *if* they introduce 5x to the GPU
    05:07 <&tear> (5x units)
    05:07 <@kendrak> I don't think they will....
    05:07 <&tear> that I do not know
    05:07 <&tear> but I think they should be made aware of potential implications :)
    05:07 <&tear> (if they aren't already)
     
  2. Untitledone

    Untitledone [H]ard|DCer of the Month - April 2012

    Messages:
    1,500
    Joined:
    Feb 28, 2011
    I will look forward to further analysis of this topic. It does seem like a conundrum.

    Link to article Tear is referencing: http://www.scalalife.eu/content/gpu-only-execution-gromacs

    Also, direct link to the chart in question:

    [​IMG]

    Highlights:

    Untitledone's Synopsis at this point:

    If the points currently assigned to this new GPU unit become the norm, the points incentive will direct us towards GPU folding. I am not against this, I will go were the science is needed. However the points disparity between GPU and SMP, if these new numbers become the norm, will be great, and all but a few systems will go dark that are running SMP. Current SMP projects that are capable of running on GPU can be ported over at the developers choosing. However it must be noted that not all projects will be able to be ported to GPU, because not all functions are supported on GPU. The bigadv work units will be exclusive to SMP for the time being due to limitations of the GPU hardware.
     
    Last edited: Oct 30, 2012
  3. 402blownstroker

    402blownstroker [H]ard|DCer of the Month - Nov. 2012

    Messages:
    3,179
    Joined:
    Jan 5, 2006
    Hip boots on and bucket of popcorn ready.
     
  4. Kendrak

    Kendrak [H]ard|DCer of the Year 2009

    Messages:
    22,871
    Joined:
    Aug 29, 2001
    It is an interesting situation.
     
  5. rhavern

    rhavern [H]ard|DCer of the Month - Apr. 2013/Oct. 2014

    Messages:
    505
    Joined:
    Mar 26, 2011
    If do
    then damned;
    If don't
    then damned;
    Whinge;
    Rinse and repeat
     
  6. Amaruk

    Amaruk n00bie

    Messages:
    57
    Joined:
    Aug 10, 2011
    Untitledone, your synopsis is spot on.

    At current value for 8057, if I add a second GTX 680 to my current 3770K+680 rig it would generate 350,000+ PPD on about 400 watts.

    That's 875 PPD/W.

    Four Keplers (660 Ti, 670 or 680) in one system (ie K9A2) should break 1000 PPD/W quite easily.


    tear and Kendrak, thanks for the transcript. :)

    I've done some analysis of the stats for 8057 which appears to mirror your own findings.

    Current stats for 8057:

    Project 8057
    Points: 2549
    Preferred: 6.86 days
    Final: 10.00 days
    k-factor: 0.75


    I used the report of the GTX580@865/1000 referenced in another thread (TPF 00:01:23, 230,716 PPD) as a baseline.

    Please note: The following examples all follow published SMP benchmarking practices as provided by PG.


    First, let's assume the k-factor and deadline are correct and adjust base point accordingly.
    Resulting base points are 282.5 and the GTX 580 credit would be 2496.128 for 25,983.794 PPD.


    Now, let's assume both the points and k-factor are off and base them on just the deadline.
    The result is a k-factor of 3 and 1130 base points.
    GTX 580 credit would be 19,969.027 for 207,870.352 PPD.


    Next, let's assume base points and deadline are correct and adjust the k-factor accordingly.
    Resulting k-factor is 6.767 and the GTX 580 credit would be 67,652.796 for 704,241.16 PPD.


    Finally, if we base the stats solely on the base points the k-factor would be 3 and final deadline becomes 22.56 days.
    GTX 580 credit would be 67,657.795 for 704,293.193 PPD.


    It's interesting that the difference between the current and projected PPD from the last two calculations mirror tear's experience running SMP.



    At this point I would be fairly comfortable guesstimating an SMP bench TPF of 00:32:30 for 8057, generating the following specs:

    Project: 8057
    Base points: 2550
    Deadline: 22.57 days
    Preferred deadline: 13.54 days
    k-factor: 3


    For the GTX 580 running TPF of 00:01:23, this yields 67,699.34 credit per WU, for 704,725.63 PPD


    Ultimately it appears that of the four key components, at least the base points are accurate.

    On the bright side, this also means the entire thing wasn't pulled from the south end of a northbound donkey... ;)


    Another interesting tidbit - GPU benchmarking has had similar problems in the past. Because of the differences in architecture, ATI/AMD has historically done better than nvidia as the WUs got bigger. Since the original GPU benchmark was based on an ATI card, every time larger projects were released nvidia PPD would drop and folders running nvidia GPUs would flock to FF to complain. This happened repeatedly until PG started benching on a Fermi.


    Looks like that issue is back, and in a BIG way...
     
  7. pjkenned

    pjkenned [H]ard|Gawd

    Messages:
    1,972
    Joined:
    Jan 8, 2010
    What's really interesting is what will happen with Xeon Phi... you get rid of all the CUDA effort and just use OpenMP which works on CPUs and Xeon Phi cards.
     
  8. 402blownstroker

    402blownstroker [H]ard|DCer of the Month - Nov. 2012

    Messages:
    3,179
    Joined:
    Jan 5, 2006
    MPI was the previous implementation of running multiple processes for a WU. Memory access across processes is a huge ding, that is why it was abondoned. I do not see them going back to it.
     
  9. Skripka

    Skripka [H]ardForum Junkie

    Messages:
    13,336
    Joined:
    Feb 5, 2012
    Math analysis like a bawss! Well explained gents!
     
  10. jebo_4jc

    jebo_4jc [H]ard|DCer of the Month - April 2011

    Messages:
    14,643
    Joined:
    Apr 8, 2005
    This looks good. Subbing so I can read it later
     
  11. jebo_4jc

    jebo_4jc [H]ard|DCer of the Month - April 2011

    Messages:
    14,643
    Joined:
    Apr 8, 2005
    I really need to read this thread!
     
    Last edited: Nov 20, 2012
  12. tear

    tear [H]ard|DCer of the Year 2011

    Messages:
    1,568
    Joined:
    Jul 25, 2011
  13. jebo_4jc

    jebo_4jc [H]ard|DCer of the Month - April 2011

    Messages:
    14,643
    Joined:
    Apr 8, 2005
    Lol. Inadvertent question mark. I fixed it now...