GPU QRB analysis

tear · Oct 30, 2012

I've promised analysis but got carried away a bit in a conversation with fellow team member
so that will need to wait... (sorry).

I did get permission to quote our conversation so, please, enjoy.

A word of caution. I will ask moderators to intervene shall the discussion depart to the level
of e-peen contest even in the slightest.

Code:

04:32 <&tear> kendrak, to put it short
04:32 <&tear> http://www.scalalife.eu/content/gpu-only-execution-gromacs
04:32 <&tear> open this.
04:32 <&tear> there's a chart close to the bottom of the page
04:33 <@kendrak> Ok.....
04:33 <&tear> now, parse this paragraph:
04:34 <&tear> MD algorithms are complex, and although the Gromacs code is highly tuned for them, they often do not translate very well onto the streaming architetures. 
              Realistic expectations about the achievable speed-up from tests with GTX280: for small protein systems in implicit solvent using all-vs-all kernels the 
              acceleration can be as high as 20 times, but in most other setups involving cutoffs and PME the acceleration is usually only about 5 times relative to a 3GHz CPU 
              core. 
04:35 <@kendrak> Core as in singular
04:35 <@kendrak> Not a 48 core boxen
04:35 <&tear> yes
04:35 <&tear> so the chart illustrates these two types of simulations
04:36 <&tear> first three simulations illustrate the '20x' from the paragraph
04:36 <@kendrak> Yes, get that
04:36 <&tear> the last two -- the '5x'
04:36 <@kendrak> I see that
04:36 <&tear> now
04:37 <&tear> all GPU projects out there are of 20x type.
04:37 <&tear> there is no GPU unit that I know of that is of 5x type
04:37 <&tear> (simplifying)
04:37 <@kendrak> Not a supprise
04:37 <&tear> at least not in circulation at this time
04:38 <&tear> now
04:40 <&tear> similarly, all SMP projects I've dealt with are of '5x' type
04:41 <&tear> (that will become interesting later)
04:41 <@kendrak> Ok.....
04:41 <&tear> but for now, let's assume that we have a mix of 5x and 20x projects
04:41 <&tear> and that both types are being served to both GPUs and CPUs
04:41 <&tear> ==
04:41 <@kendrak> Talk about a PPD rollarcoaster
04:41 <&tear> if PG uses SMP as a 'global' benchmark
04:42 <@kendrak> Ok....
04:42 <&tear> that means the SMP rigs get relatively flat PPD output across the board
04:42 <&tear> (for both '5x' and '20x')
04:42 <&tear> (a consequence of SMP being a benchmark)
04:42 <&tear> but... GPUs get ridiculuously high points for 20x units
04:42 <@kendrak> Understand
04:42 <&tear> because they calculate them so well
04:43 <&tear> that creates an incentive for GPU users to cherry-pick WUs
04:43 <&tear> =====
04:43 <@kendrak> That will get messy quick
04:44 <&tear> now, imagine they do the opposite
04:44 <&tear> use GPU as a baseline
04:44 <&tear> GPU PPD output becomes ~flat
04:44 <&tear> but some units (20x units) get penalized on SMP
04:45 <&tear> == create incentive for SMP to only fold 5x units [!]
04:46 <&tear> with these two types of units
04:46 <&tear> it's impossible to satisfy these two at the same time:
04:46 <&tear> 1. Equal points for equal work
04:46 <&tear> 2. ~Flat PPD output for a given hardware
04:46 <&tear> ====
04:46 <&tear> I would honestly recommend keeping '20x' units to GPUs
04:47 <&tear> and keeping 5x to both SMP and GPUs
04:47 <&tear> ===
04:48 <@kendrak> So a 48 core system at 2.0ghz.....
04:48 <&tear> 20x units scale very bad on SMP, too.
04:49 <&tear> 2.4 GHz 12 core == TPF of 15m and some seconds (on an 8057 -- '20x' unit)
04:49 <&tear> 3.0 GHz 48 core == TPF of 8m 
04:49 <&tear> (on the same unit)
04:49 <&tear> (even less reason to keep them on CPUs)
04:51 <@kendrak> I understand the points tear
04:52 <&tear> kendrak, now, what's even more funny
04:52 <@kendrak> However.... the community will only want to do20x GPU
04:52 <&tear> kendrak, not if points are flat on the GPU
04:53 <&tear> kendrak, but wait for this...
04:53  * kendrak waits for it
04:53 <&tear> kendrak, I'm getting 6k PPD on 8057 on my desktop
04:53 <&tear> kendrak, regular SMP units are getting 26k PPD
04:54 <&tear> imagine what PPD GPUs would be getting now
04:54 <&tear> if points were such that I'd be getting 26k PPD on 8057, not 6k :)
04:54 <&tear> the gap is so BIG that it just doesn't make sense to feed these 20x units to SMP _at_ _all_.
04:56 <&tear> points would be even higher than the 50-250k they're getting now [!]
04:57 <@kendrak> Or do they just hand out points like penny candy?
04:57 <@kendrak> If they keep the WU on "optimized" hardware....
04:57 <&tear> I've no idea what their intents are
04:57 <&tear> or if they realize consequences..
04:58 <&tear> hell, they are researchers so they sure should know what performance difference is on  '20x' type
04:58 <@kendrak> Only GPUs running 20x WU will be the focus
04:59 <&tear> the fact that we only have '20x' type units on the GPU (only one with bonus)
04:59 <@kendrak> Not sure they comprehend what giving 2/3rds the ppd to a gtx 580 vs a 4P does to their SMP potential
04:59 <&tear> could lead some to conspiracy theories :)
04:59 <@kendrak> Could.....
04:59 <&tear> ah, yes, that's another interesting consequence!
04:59 <@kendrak> Very diplomatic of you
04:59 <&tear> I'm trying :)
05:00 <&tear> now, look at this
05:00 <&tear> top dawg folding 8057 gets 250k PPD
05:00 <@kendrak> Mmm hum
05:00 <&tear> with power consumption of about 250W card + 100W system -- 350W
05:00 <&tear> +/-
05:01 <&tear> we're talking ballpark figures so we don't need to be super accurate
05:01 <&tear> that's 714 PPD/Watt
05:01 <@kendrak> About where a good 4P sits
05:01 <&tear> not exactly
05:01 <&tear> a _very_ good 4P gets about 500k PPD with 8101 (and 800W of power)
05:02 <&tear> that's 625 PPD/Watt [!]
05:03 <@kendrak> So we have GPUs as a new PPD/w king (in general)
05:03 <&tear> so 4P 8101 folders (read: majority) are getting screwed by lucky/cherry-picking GPU folders
05:03 <&tear> 8102s are better but only a little bit.
05:04 <@kendrak> Then the luck/cherry goes away and it becomes the norm
05:04 <@kendrak> If only 20x WU are given to GPUs
05:04 <&tear> correct.
05:05 <&tear> ==
05:05 <@kendrak> And SMP will go dark (for the most part)
05:05 <&tear> yup, completely dark.
05:05 <&tear> there will be no incentive for "regular SMP" any more
05:05 <&tear> forget all your 2700k, 3960X ...
05:06 <@kendrak> And the science is limited, because only one type of math is worth doing
05:06 <&tear> kendrak, not really
05:06 <&tear> kendrak, the rest is '5x'
05:06 <&tear> kendrak, so it's still very good on GPU
05:07 <&tear> kendrak, *if* they introduce 5x to the GPU
05:07 <&tear> (5x units)
05:07 <@kendrak> I don't think they will....
05:07 <&tear> that I do not know
05:07 <&tear> but I think they should be made aware of potential implications :)
05:07 <&tear> (if they aren't already)

Untitledone · Oct 30, 2012

I will look forward to further analysis of this topic. It does seem like a conundrum.

Link to article Tear is referencing: http://www.scalalife.eu/content/gpu-only-execution-gromacs

Also, direct link to the chart in question:

Highlights:

Limitations:
The following should be noted before using the GPU accelerated mdrun-gpu:

The current release runs only on modern nVidia GPU hardware with CUDA support. Make sure that the necessary CUDA drivers and libraries for your operating system are already installed.

Multiple GPU cards are not supported.

Only a fairly small subset of the GROMACS features and options are supported on the GPUs. See below for a detailed list.

Consumer level GPU cards are known to often have problems with faulty memory. It is recommended that a full memory check of the cards is done at least once (for example, using the memtest=full option). A partial memory check (for example, memtest=15) before and after the simulation run would help spot problems resulting from overheating of the graphics card.

The maximum size of the simulated systems depends on the available GPU memory, for example, a GTX280 with 1GB memory has been tested with systems of up to about 100,000 atoms.

In order to take a full advantage of the GPU platform features, many algorithms have been implemented in a very different way than they are on the CPUs. Therefore numercal correspondence between some properties of the systems' state should not be expected. Moreover, the values will likely vary when simulations are done on different GPU hardware. However, sufficiently long trajectories should produce comparable statistical averages.

Frequent retrieval of system state information such as trajectory coordinates and energies can greatly influence the performance of the program due to slow CPU<–>GPU memory transfer speed.

MD algorithms are complex, and although the Gromacs code is highly tuned for them, they often do not translate very well onto the streaming architetures. Realistic expectations about the achievable speed-up from tests with GTX280: for small protein systems in implicit solvent using all-vs-all kernels the acceleration can be as high as 20 times, but in most other setups involving cutoffs and PME the acceleration is usually only about 5 times relative to a 3GHz CPU core.

General advantages of the GPU version:

The algorithms used on the GPU will automatically guarantee that all interactions inside the cutoff are calculated every step, which effectively is equivalent to a slightly longer cutoff.

The GPU kernels excel at floating-point intensive calculations, in particular OBC implicit-solvent calculations.

Due to the higher nonbonded kernel performance, it is quite efficient to use longer cutoffs (which is also useful for implicit solvent)

The accuracy of the PME solver is slightly higher than the default Gromacs values. The kernels are quite conservative in this regard, and never resort to linear interpolation or other lower-accuracy alternatives.

It beats the CPU version in most cases, even when compared to using 8 cores on a cluster node (The CPU version is automatically threaded from v. 4.5

General disadvantages of the current GPU version:

Parallel runs don't work yet. We're still working on this, but to make a long story short it's very challenging to achieve performance that actually beats multiple nodes using CPUs. This will be supported in a future version, though.

Not all Gromacs features are supported yet, such as triclinic unit cells or virtual interaction sites required for virtual hydrogens.

Forcefields that do not use combination rules for Lennard-Jones interactions are not supported yet.

You cannot yet run multi-million atom simulations on present GPU hardware.

File I/O is more expensive relative to the CPU version, so be careful not to write coordinates every 100 steps!

GPU Benchmarks:
Apart from interest in new technology and algorithms, the obvious reason to do simulations on GPUs is to improve performance, and you are most likely interested in speedup relative to the CPU version. This is of course our target too, but it is important to understand that the heavily accelerated/tuned assembly kernels we have developed for x86 over the last decade makes this relative speedup a quite difficult challenge! Thus, rather than looking at relative speedup you should compare raw absolute performance for matching settings. Relative speedup is meaningless unless you use the same comparison baseline!
In general, the first important point to get achieve good performance is to understand that GPUs are different from CPUs. While you can try to just run your present simulation you might get significantly better performance with slightly different settings, and if you are willing to make more fundamental trade-offs you can sometimes even get order-of-magnitude speedups.
Due to the different algorithms used some of the parameters in the input mdp files are interpreted slightly differently for the GPU, so below we have created a set of benchmarks that try to create settings that are as close to equivalent as possible for a CPU and GPU. This is not to say they will be ideal for you, but by explaining some of the differences we hope to help you make an informed decision, and hopefully use the hardware in a better way.

Recommendations:
It is ultimately up to you as a user to decide what simulations setups to use, but we would like to emphasize the simply amazing implicit solvent performance provided by GPUs. If money and resources are completely unimportant you can always get quite good parallel performance from a CPU-cluster too, but by adding a GPU to a workstation it is suddenly possible to reach the microsecond/day range e.g. for protein model refinement. Similarly, GPUs excel at throughput style computations even in clusters, including many explicit-solvent simulations (and here we always compare to using all x86 cores on the cluster nodes). We're certainly interested in your own experiences and recommendations/tips/tricks!

Untitledone's Synopsis at this point:

If the points currently assigned to this new GPU unit become the norm, the points incentive will direct us towards GPU folding. I am not against this, I will go were the science is needed. However the points disparity between GPU and SMP, if these new numbers become the norm, will be great, and all but a few systems will go dark that are running SMP. Current SMP projects that are capable of running on GPU can be ported over at the developers choosing. However it must be noted that not all projects will be able to be ported to GPU, because not all functions are supported on GPU. The bigadv work units will be exclusive to SMP for the time being due to limitations of the GPU hardware.

402blownstroker · Oct 30, 2012

Hip boots on and bucket of popcorn ready.

Kendrak · Oct 30, 2012

It is an interesting situation.

rhavern · Oct 30, 2012

If do
then damned;
If don't
then damned;
Whinge;
Rinse and repeat

Amaruk · Oct 30, 2012

Untitledone, your synopsis is spot on.

At current value for 8057, if I add a second GTX 680 to my current 3770K+680 rig it would generate 350,000+ PPD on about 400 watts.

That's 875 PPD/W.

Four Keplers (660 Ti, 670 or 680) in one system (ie K9A2) should break 1000 PPD/W quite easily.

tear and Kendrak, thanks for the transcript.

I've done some analysis of the stats for 8057 which appears to mirror your own findings.

Current stats for 8057:

Project 8057
Points: 2549
Preferred: 6.86 days
Final: 10.00 days
k-factor: 0.75

I used the report of the GTX580@865/1000 referenced in another thread (TPF 00:01:23, 230,716 PPD) as a baseline.

Please note: The following examples all follow published SMP benchmarking practices as provided by PG.

First, let's assume the k-factor and deadline are correct and adjust base point accordingly.
Resulting base points are 282.5 and the GTX 580 credit would be 2496.128 for 25,983.794 PPD.

Now, let's assume both the points and k-factor are off and base them on just the deadline.
The result is a k-factor of 3 and 1130 base points.
GTX 580 credit would be 19,969.027 for 207,870.352 PPD.

Next, let's assume base points and deadline are correct and adjust the k-factor accordingly.
Resulting k-factor is 6.767 and the GTX 580 credit would be 67,652.796 for 704,241.16 PPD.

Finally, if we base the stats solely on the base points the k-factor would be 3 and final deadline becomes 22.56 days.
GTX 580 credit would be 67,657.795 for 704,293.193 PPD.

It's interesting that the difference between the current and projected PPD from the last two calculations mirror tear's experience running SMP.

At this point I would be fairly comfortable guesstimating an SMP bench TPF of 00:32:30 for 8057, generating the following specs:

Project: 8057
Base points: 2550
Deadline: 22.57 days
Preferred deadline: 13.54 days
k-factor: 3

For the GTX 580 running TPF of 00:01:23, this yields 67,699.34 credit per WU, for 704,725.63 PPD

Ultimately it appears that of the four key components, at least the base points are accurate.

On the bright side, this also means the entire thing wasn't pulled from the south end of a northbound donkey...

Another interesting tidbit - GPU benchmarking has had similar problems in the past. Because of the differences in architecture, ATI/AMD has historically done better than nvidia as the WUs got bigger. Since the original GPU benchmark was based on an ATI card, every time larger projects were released nvidia PPD would drop and folders running nvidia GPUs would flock to FF to complain. This happened repeatedly until PG started benching on a Fermi.

Looks like that issue is back, and in a BIG way...

pjkenned · Oct 30, 2012

What's really interesting is what will happen with Xeon Phi... you get rid of all the CUDA effort and just use OpenMP which works on CPUs and Xeon Phi cards.

402blownstroker · Oct 30, 2012

MPI was the previous implementation of running multiple processes for a WU. Memory access across processes is a huge ding, that is why it was abondoned. I do not see them going back to it.

Skripka · Oct 30, 2012

Math analysis like a bawss! Well explained gents!

jebo_4jc · Oct 30, 2012

This looks good. Subbing so I can read it later

jebo_4jc · Nov 20, 2012

I really need to read this thread!

tear · Nov 20, 2012

Hell yeah!

jebo_4jc · Nov 20, 2012

Lol. Inadvertent question mark. I fixed it now...

GPU QRB analysis

tear

[H]ard|DCer of the Year 2011

Untitledone

[H]ard|DCer of the Month - April 2012

402blownstroker

[H]ard|DCer of the Month - Nov. 2012

Kendrak

[H]ard|DCer of the Year 2009

rhavern

[H]ard|DCer of the Month - Apr. 2013/Oct. 2014

Amaruk

n00b

pjkenned

[H]ard|Gawd

402blownstroker

[H]ard|DCer of the Month - Nov. 2012

Skripka

[H]F Junkie

jebo_4jc

[H]ard|DCer of the Month - April 2011

jebo_4jc

[H]ard|DCer of the Month - April 2011

tear

[H]ard|DCer of the Year 2011

jebo_4jc

[H]ard|DCer of the Month - April 2011