I've promised analysis but got carried away a bit in a conversation with fellow team member so that will need to wait... (sorry). I did get permission to quote our conversation so, please, enjoy. A word of caution. I will ask moderators to intervene shall the discussion depart to the level of e-peen contest even in the slightest. Code: 04:32 <&tear> kendrak, to put it short 04:32 <&tear> http://www.scalalife.eu/content/gpu-only-execution-gromacs 04:32 <&tear> open this. 04:32 <&tear> there's a chart close to the bottom of the page 04:33 <@kendrak> Ok..... 04:33 <&tear> now, parse this paragraph: 04:34 <&tear> MD algorithms are complex, and although the Gromacs code is highly tuned for them, they often do not translate very well onto the streaming architetures. Realistic expectations about the achievable speed-up from tests with GTX280: for small protein systems in implicit solvent using all-vs-all kernels the acceleration can be as high as 20 times, but in most other setups involving cutoffs and PME the acceleration is usually only about 5 times relative to a 3GHz CPU core. 04:35 <@kendrak> Core as in singular 04:35 <@kendrak> Not a 48 core boxen 04:35 <&tear> yes 04:35 <&tear> so the chart illustrates these two types of simulations 04:36 <&tear> first three simulations illustrate the '20x' from the paragraph 04:36 <@kendrak> Yes, get that 04:36 <&tear> the last two -- the '5x' 04:36 <@kendrak> I see that 04:36 <&tear> now 04:37 <&tear> all GPU projects out there are of 20x type. 04:37 <&tear> there is no GPU unit that I know of that is of 5x type 04:37 <&tear> (simplifying) 04:37 <@kendrak> Not a supprise 04:37 <&tear> at least not in circulation at this time 04:38 <&tear> now 04:40 <&tear> similarly, all SMP projects I've dealt with are of '5x' type 04:41 <&tear> (that will become interesting later) 04:41 <@kendrak> Ok..... 04:41 <&tear> but for now, let's assume that we have a mix of 5x and 20x projects 04:41 <&tear> and that both types are being served to both GPUs and CPUs 04:41 <&tear> == 04:41 <@kendrak> Talk about a PPD rollarcoaster 04:41 <&tear> if PG uses SMP as a 'global' benchmark 04:42 <@kendrak> Ok.... 04:42 <&tear> that means the SMP rigs get relatively flat PPD output across the board 04:42 <&tear> (for both '5x' and '20x') 04:42 <&tear> (a consequence of SMP being a benchmark) 04:42 <&tear> but... GPUs get ridiculuously high points for 20x units 04:42 <@kendrak> Understand 04:42 <&tear> because they calculate them so well 04:43 <&tear> that creates an incentive for GPU users to cherry-pick WUs 04:43 <&tear> ===== 04:43 <@kendrak> That will get messy quick 04:44 <&tear> now, imagine they do the opposite 04:44 <&tear> use GPU as a baseline 04:44 <&tear> GPU PPD output becomes ~flat 04:44 <&tear> but some units (20x units) get penalized on SMP 04:45 <&tear> == create incentive for SMP to only fold 5x units [!] 04:46 <&tear> with these two types of units 04:46 <&tear> it's impossible to satisfy these two at the same time: 04:46 <&tear> 1. Equal points for equal work 04:46 <&tear> 2. ~Flat PPD output for a given hardware 04:46 <&tear> ==== 04:46 <&tear> I would honestly recommend keeping '20x' units to GPUs 04:47 <&tear> and keeping 5x to both SMP and GPUs 04:47 <&tear> === 04:48 <@kendrak> So a 48 core system at 2.0ghz..... 04:48 <&tear> 20x units scale very bad on SMP, too. 04:49 <&tear> 2.4 GHz 12 core == TPF of 15m and some seconds (on an 8057 -- '20x' unit) 04:49 <&tear> 3.0 GHz 48 core == TPF of 8m 04:49 <&tear> (on the same unit) 04:49 <&tear> (even less reason to keep them on CPUs) 04:51 <@kendrak> I understand the points tear 04:52 <&tear> kendrak, now, what's even more funny 04:52 <@kendrak> However.... the community will only want to do20x GPU 04:52 <&tear> kendrak, not if points are flat on the GPU 04:53 <&tear> kendrak, but wait for this... 04:53 * kendrak waits for it 04:53 <&tear> kendrak, I'm getting 6k PPD on 8057 on my desktop 04:53 <&tear> kendrak, regular SMP units are getting 26k PPD 04:54 <&tear> imagine what PPD GPUs would be getting now 04:54 <&tear> if points were such that I'd be getting 26k PPD on 8057, not 6k :) 04:54 <&tear> the gap is so BIG that it just doesn't make sense to feed these 20x units to SMP _at_ _all_. 04:56 <&tear> points would be even higher than the 50-250k they're getting now [!] 04:57 <@kendrak> Or do they just hand out points like penny candy? 04:57 <@kendrak> If they keep the WU on "optimized" hardware.... 04:57 <&tear> I've no idea what their intents are 04:57 <&tear> or if they realize consequences.. 04:58 <&tear> hell, they are researchers so they sure should know what performance difference is on '20x' type 04:58 <@kendrak> Only GPUs running 20x WU will be the focus 04:59 <&tear> the fact that we only have '20x' type units on the GPU (only one with bonus) 04:59 <@kendrak> Not sure they comprehend what giving 2/3rds the ppd to a gtx 580 vs a 4P does to their SMP potential 04:59 <&tear> could lead some to conspiracy theories :) 04:59 <@kendrak> Could..... 04:59 <&tear> ah, yes, that's another interesting consequence! 04:59 <@kendrak> Very diplomatic of you 04:59 <&tear> I'm trying :) 05:00 <&tear> now, look at this 05:00 <&tear> top dawg folding 8057 gets 250k PPD 05:00 <@kendrak> Mmm hum 05:00 <&tear> with power consumption of about 250W card + 100W system -- 350W 05:00 <&tear> +/- 05:01 <&tear> we're talking ballpark figures so we don't need to be super accurate 05:01 <&tear> that's 714 PPD/Watt 05:01 <@kendrak> About where a good 4P sits 05:01 <&tear> not exactly 05:01 <&tear> a _very_ good 4P gets about 500k PPD with 8101 (and 800W of power) 05:02 <&tear> that's 625 PPD/Watt [!] 05:03 <@kendrak> So we have GPUs as a new PPD/w king (in general) 05:03 <&tear> so 4P 8101 folders (read: majority) are getting screwed by lucky/cherry-picking GPU folders 05:03 <&tear> 8102s are better but only a little bit. 05:04 <@kendrak> Then the luck/cherry goes away and it becomes the norm 05:04 <@kendrak> If only 20x WU are given to GPUs 05:04 <&tear> correct. 05:05 <&tear> == 05:05 <@kendrak> And SMP will go dark (for the most part) 05:05 <&tear> yup, completely dark. 05:05 <&tear> there will be no incentive for "regular SMP" any more 05:05 <&tear> forget all your 2700k, 3960X ... 05:06 <@kendrak> And the science is limited, because only one type of math is worth doing 05:06 <&tear> kendrak, not really 05:06 <&tear> kendrak, the rest is '5x' 05:06 <&tear> kendrak, so it's still very good on GPU 05:07 <&tear> kendrak, *if* they introduce 5x to the GPU 05:07 <&tear> (5x units) 05:07 <@kendrak> I don't think they will.... 05:07 <&tear> that I do not know 05:07 <&tear> but I think they should be made aware of potential implications :) 05:07 <&tear> (if they aren't already)
I will look forward to further analysis of this topic. It does seem like a conundrum. Link to article Tear is referencing: http://www.scalalife.eu/content/gpu-only-execution-gromacs Also, direct link to the chart in question: Highlights: Untitledone's Synopsis at this point: If the points currently assigned to this new GPU unit become the norm, the points incentive will direct us towards GPU folding. I am not against this, I will go were the science is needed. However the points disparity between GPU and SMP, if these new numbers become the norm, will be great, and all but a few systems will go dark that are running SMP. Current SMP projects that are capable of running on GPU can be ported over at the developers choosing. However it must be noted that not all projects will be able to be ported to GPU, because not all functions are supported on GPU. The bigadv work units will be exclusive to SMP for the time being due to limitations of the GPU hardware.
Untitledone, your synopsis is spot on. At current value for 8057, if I add a second GTX 680 to my current 3770K+680 rig it would generate 350,000+ PPD on about 400 watts. That's 875 PPD/W. Four Keplers (660 Ti, 670 or 680) in one system (ie K9A2) should break 1000 PPD/W quite easily. tear and Kendrak, thanks for the transcript. I've done some analysis of the stats for 8057 which appears to mirror your own findings. Current stats for 8057: Project 8057 Points: 2549 Preferred: 6.86 days Final: 10.00 days k-factor: 0.75 I used the report of the GTX580@865/1000 referenced in another thread (TPF 00:01:23, 230,716 PPD) as a baseline. Please note: The following examples all follow published SMP benchmarking practices as provided by PG. First, let's assume the k-factor and deadline are correct and adjust base point accordingly. Resulting base points are 282.5 and the GTX 580 credit would be 2496.128 for 25,983.794 PPD. Now, let's assume both the points and k-factor are off and base them on just the deadline. The result is a k-factor of 3 and 1130 base points. GTX 580 credit would be 19,969.027 for 207,870.352 PPD. Next, let's assume base points and deadline are correct and adjust the k-factor accordingly. Resulting k-factor is 6.767 and the GTX 580 credit would be 67,652.796 for 704,241.16 PPD. Finally, if we base the stats solely on the base points the k-factor would be 3 and final deadline becomes 22.56 days. GTX 580 credit would be 67,657.795 for 704,293.193 PPD. It's interesting that the difference between the current and projected PPD from the last two calculations mirror tear's experience running SMP. At this point I would be fairly comfortable guesstimating an SMP bench TPF of 00:32:30 for 8057, generating the following specs: Project: 8057 Base points: 2550 Deadline: 22.57 days Preferred deadline: 13.54 days k-factor: 3 For the GTX 580 running TPF of 00:01:23, this yields 67,699.34 credit per WU, for 704,725.63 PPD Ultimately it appears that of the four key components, at least the base points are accurate. On the bright side, this also means the entire thing wasn't pulled from the south end of a northbound donkey... Another interesting tidbit - GPU benchmarking has had similar problems in the past. Because of the differences in architecture, ATI/AMD has historically done better than nvidia as the WUs got bigger. Since the original GPU benchmark was based on an ATI card, every time larger projects were released nvidia PPD would drop and folders running nvidia GPUs would flock to FF to complain. This happened repeatedly until PG started benching on a Fermi. Looks like that issue is back, and in a BIG way...
What's really interesting is what will happen with Xeon Phi... you get rid of all the CUDA effort and just use OpenMP which works on CPUs and Xeon Phi cards.
MPI was the previous implementation of running multiple processes for a WU. Memory access across processes is a huge ding, that is why it was abondoned. I do not see them going back to it.