GTX680 CUDA Performance?

exelias0 · Mar 22, 2012

Has anyone found a good review comparing CUDA performance to the GTX580/590? Once I get rolling on CUDA development I'd like to upgrade to something beefier than my 8800.

Eriksrocks · Mar 22, 2012

Rumors are that it will perform worse (perhaps much worse) on GPGPU stuff because it was originally intended to be mid-range and nVidia stripped out all the GPGPU to give it better gaming performance.

I haven't seen comprehensive data to back this up yet though. If you're looking for CUDA and only CUDA I would wait until the 580 prices drop substantially and then get one of those.

Liger88 · Mar 22, 2012

I myself am biting my nails at the CUDA tests to come, especially to see the folding difference between the GTX580 > GTX680. It's lame, but damn it folding is addictive lol.

DeathPrincess · Mar 22, 2012

Those rumors are based on one crappy openCL benchmark done by someone known to be an idiot. Which is a totally different animal to CUDA, as one is cared about, the other not so much. Would like to see some proper benchmarks too! It's very very unlikely they've stripped out all the GPGPU parts, because otherwise it would have no physx/CUDA.

Would also like to see some folding@home benchmarks whenever possible. Though I don't think the fermi core would work on it yet.

limitedaccess · Mar 22, 2012

The anandtech review has more regarding GPGPU although no CUDA tests.

walkman · Mar 22, 2012

limitedaccess said:
The anandtech review has more regarding GPGPU although no CUDA tests.

I believe Anand pointed out that the 680 wouldn't be used as the basis for the Quattro professional type cards so it makes sense that this is not the card targeted at Cuda.

nVidia made a big deal out of CUDA performance on the 580. I wonder if it generated the sales volume they were hoping. It's the main reason I haven't bought an AMD/ATI card in a long time since I use a lot of Adobe products.

I will be tempted to the 680 if CUDA does get a good bump - otherwise I'm probably going to wait for the xx110 version.

R3MF · Mar 23, 2012

exelias0 said:
Has anyone found a good review comparing CUDA performance to the GTX580/590? Once I get rolling on CUDA development I'd like to upgrade to something beefier than my 8800.

should know shortly once one of the blender/cycles crowd gets hold of a 680GTX:
http://blenderartists.org/forum/showthread.php?239480-2.61-Cycles-render-benchmark

DeathPrincess said:
Those rumors are based on one crappy openCL benchmark done by someone known to be an idiot. Which is a totally different animal to CUDA, as one is cared about, the other not so much. Would like to see some proper benchmarks too! It's very very unlikely they've stripped out all the GPGPU parts, because otherwise it would have no physx/CUDA.

Would also like to see some folding@home benchmarks whenever possible. Though I don't think the fermi core would work on it yet.

some kind of a fan-boi, much?

few quotes from anand:

"What is clear at this time though is that NVIDIA is pitching GTX 680 specifically for consumer graphics while downplaying compute, which says a lot right there. Given their call for efficiency and how some of Fermi’s compute capabilities were already stripped for GF114, this does read like an attempt to further strip compute capabilities from their consumer GPUs in order to boost efficiency. Amusingly, whereas AMD seems to have moved closer to Fermi (Nvidia 580) with GCN by adding compute performance, NVIDIA seems to have moved closer to Cayman (AMD 69xx) with Kepler by taking it away."

"So NVIDIA has replaced Fermi’s complex scheduler with a far more simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling. Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN."

"Redemption at last? In our final compute benchmark the GTX 680 finally shows that it can still succeed in some compute scenarios, taking a rather impressive lead over both the 7970 and the GTX 580. At this point it’s not particularly clear why the GTX 680 does so well here and only here, but the fact that this is a compute shader program as opposed to an OpenCL program may have something to do with it. NVIDIA needs solid compute shader performance for the games that use it; OpenCL performance however can take a backseat."

"What makes this launch particularly interesting if not amusing though is how we’ve ended up here. Since Cypress and Fermi NVIDIA and AMD have effectively swapped positions. It’s now AMD who has produced a higher TDP video card that is strong in both compute and gaming, while NVIDIA has produced the lower TDP part that is similar to the Radeon HD 5870 right down to the display outputs."

Kepler is great............ at games!

Forceman · Mar 23, 2012

R3MF said:
Kepler is great............ at games!

And that's all most of the people buying these cards cares about.

R3MF · Mar 23, 2012

Forceman said:
And that's all most of the people buying these cards cares about.

for sure, but this is a thread about CUDA/OpenCL/compute, and it is to that i responded.

Zarin · Mar 23, 2012

Sorry to report, but this is the one area where the 680 GTX can't keep up with 7970.

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/17

It's OpenCL/compute are FAR behind those of the 7970 and its CUDA are even worse than the 580's. but as a gaming card its top. But you also have to remember this is the GK104 or what was to be Nvidia's mid-range card. So, there's still hope that the GK110 will change this, but we may also pay a premium for it too. I may still get the 7970 if the GK110 doesn't come through has I will need a good balance of GPGPU performance and gaming in my up and coming build.

diceman2037 · Mar 23, 2012

Nvidia have never paid attention to OpenCL, if they did it wouldn't be a half assed cuda wrapper.

Sycraft · Mar 24, 2012

Eriksrocks said:
Rumors are that it will perform worse (perhaps much worse) on GPGPU stuff because it was originally intended to be mid-range and nVidia stripped out all the GPGPU to give it better gaming performance.

I haven't seen comprehensive data to back this up yet though. If you're looking for CUDA and only CUDA I would wait until the 580 prices drop substantially and then get one of those.

First off, what rumours and based on what data? I hate when people post things like this because more or less it is "I am just making something up," or "I heard from some dude on some forum." As Wikipedia loves to say [citation needed].

Second, it doesn't work like that. The GPGPU stuff (32-bit at least) is the same shit as the gaming stuff. It is all just the shaders. That is what lead to GPGPU in the first place. GPUs had increasingly become stream processors, however the setup was inefficient in both API and layout. So with DX10 things were respec'd to be just unified stream processors aka shaders that could be used by any stage. That also made them excellent GPGPU units.

It isn't as though they are magically good at running game code but bad at other code. They are what the are.

In terms of CUDA benchmarks, I don't know of any good ones. You can try CUDA-Z if you want, here's the results from my 680:

Performance Information
-----------------------
Memory Copy
Host Pinned to Device: 6231.78 MB/s
Host Pageable to Device: 4091.38 MB/s
Device to Host Pinned: 6257.27 MB/s
Device to Host Pageable: 3997.35 MB/s
Device to Device: 72118.9 MB/s
GPU Core Performance
Single-precision Float: 2.01397e+06 Mflop/s
Double-precision Float: 143681 Mflop/s
32-bit Integer: 573612 Miop/s
24-bit Integer: 572383 Miop/s

Of course who know on accuracy, it gets some things wrong in terms of specs:

Core Information
----------------
Name: GeForce GTX 680
Compute Capability: 3.0
Clock Rate: 705.5 MHz
Multiprocessors: 8
Warp Size: 32
Regs Per Block: 65536
Threads Per Block: 1024
Watchdog Enabled: Yes
Threads Dimentions: 1024 x 1024 x 64
Grid Dimentions: 2147483647 x 65535 x 65535

In terms of OpenCL performance, that isn't something you want to use to compare GPUs on account of AMD likes and uses OpenCL and for nVidia it is an afterthought since they'd rather people use CUDA (and thus not support AMD). Now if the program you are using is OpenCL then it is relevant, however if you are asking how things compare on the low level, it isn't so useful.

Ultimately I can't imagine it is a big enough increase to justify buying new hardware, if it is an increase, unless lesser power consumption is desired. Also for real heavy GPGPU stuff, you want a Quadro or Tesla because of memory. You can get those with 6GB on them.

In terms of use for regular desktop CUDA stuff, like Sony Vegas or the like, I imagine it'll be fine.

Donat76 · Mar 24, 2012

Nvidia has a high end GPGPU market to protect... AMD doesn't... I think that pretty much explains why they did what they did with the 680 as far as compute.

Red Falcon · Mar 25, 2012

Memory Copy
Host Pinned to Device: 6231.78 MB/s
Host Pageable to Device: 4091.38 MB/s
Device to Host Pinned: 6257.27 MB/s
Device to Host Pageable: 3997.35 MB/s
Device to Device: 72118.9 MB/s
GPU Core Performance
Single-precision Float: 2.01397e+06 Mflop/s
Double-precision Float: 143681 Mflop/s
32-bit Integer: 573612 Miop/s
24-bit Integer: 572383 Miop/s

Thank you for sharing, I was looking for this information.
Impressive specs on the Single-precision, too bad NVIDIA disables the full functionality of the double-precision with their GeForce line.

Donat76 · Mar 25, 2012

Google "GTX 680 double precision" if you want to see some thread and articles about this.

https://www.google.com/search?q=nvidia+gtx+680+double+precision

Kepler 2 for servers...
http://www.theregister.co.uk/2012/03/22/nvidia_kepler_gpu_preview/

It's not an artificial cap either. It's the way the hardware was designed for better performance per watt. I honestly think the 7970 is a better overall buy if compute is important to you. I'm still waiting for prices and stock availability before I really decide.

If CUDA is your main use. A less expensive 580 would be the better buy.

Red Falcon · Mar 25, 2012

It's not an artificial cap either. It's the way the hardware was designed for better performance per watt.

Then why do Tesla GPUs have a higher double-precision when they are the exact same GPU? (speaking from the GTX580 era that is)
The reason I say this is that the Tesla GPU's double-precision is exactly 1/2 of the single-precision, not having an artificial cap.
GeForce GPU's double-precision is exactly 1/8 of the single-precision, hence why it is called an artificial cap.

Not saying you're wrong about the power design, I'd just like to learn more.
Read through the article, but it didn't explain much on that.

Donat76 · Mar 25, 2012

Red Falcon said:
Then why do Tesla GPUs have a higher double-precision when they are the exact same GPU? (speaking from the GTX580 era that is)
The reason I say this is that the Tesla GPU's double-precision is exactly 1/2 of the single-precision, not having an artificial cap.
GeForce GPU's double-precision is exactly 1/8 of the single-precision, hence why it is called an artificial cap.

Not saying you're wrong about the power design, I'd just like to learn more.
Read through the article, but it didn't explain much on that.

Geforce 680 GTX Whitepaper
https://www.google.com/search?q=GeForce-GTX-680-Whitepaper-FINAL.pdf

Fermi Whitepapers
http://www.nvidia.com/content/PDF/f...DIA_Fermi_Compute_Architecture_Whitepaper.pdf

http://www.siliconmechanics.com/files/C2050Benchmarks.pdf

Red Falcon · Mar 25, 2012

http://forums.nvidia.com/index.php?showtopic=165055

Double precision is 1/2 of single precision for Tesla 20-series, whereas double precision
is 1/8th of single precision for GeForce GTX 470/480

http://www.nvidia.com/object/why-choose-tesla.html

Full double precision performance

I think we're saying the same thing at this point.
I do understand why the limit was included however, since increasing the DP would also mean increasing the power requirements, which most desktop users do not need, and also saving that ability for their Tesla line so they can sell at a premium price.

Also see these:
http://hexus.net/tech/tech-explained/graphics/29922-tesla-vs-geforce/
http://www.nvidia.com/object/workstation-solutions-tesla.html

Donat76 · Mar 25, 2012

Artificial how? It was designed that way for a reason, not just to screw with us. It's more streamlined for gaming on the hardware level. Nothing wrong with that unless compute, CUDA are important to you. By using less power and having unnecessary transistors, it will make less heat and it's turbo boost can kick in for higher clocks at a given TDP. That simplifies it quite a bit but I'm sure these things are well thought out in advance whether we understand their reasoning or not.

Donat76 · Mar 25, 2012

I think we are on the same page for the most part but my lack of sleep is derailing my train of thought. I wouldn't mind having "Kepler 2.0" if they make a consumer version of it. I'm not going to wait for it though. I'm just looking for more mature drivers, SLI scaling and custom cooling solutions. I am assuming GPU boost may work better under more favorable cooling solutions.

Red Falcon · Mar 25, 2012

Artificial how?

Because Tesla and GeForce GPUs are exactly the same, though it's not really "artificial" as much as it is just disabled.
As listed in the links above, Tesla has some obvious benefits such as increased testing, ECC memory, etc.

But Tesla's DP is 1/2 that of it's SP.
GeForce's DP is 1/8 that of it's SP.

All of the original transistors are still there, they are just disabled and are impossible to re-enable them, so GeForce users are stuck with 1/4 of the DP processing power.
Granted, as you stated, it is like that for a reason, power consumption and heat being some of the top on the list, and I agree.

I think we are on the same page for the most part

I think so, we are just saying two different things, but mean the same thing.

R3MF · Mar 27, 2012

the 680GTX is getting some pretty ropey results on blender/cycles:

http://blenderartists.org/forum/showthread.php?239480-2.61-Cycles-render-benchmark/page18

is this because cycles uses DP which helps the 570/580 GTX series GPU's?

this would seem to indicate it is:

http://www.primegrid.com/forum_thread.php?id=4044&nowrap=true#51937

R3MF · Mar 27, 2012

two paragraphs sound similar do they not:

For GF104 (read: GF114), NVIDIA removed FP64 from only 2 of the 3 blocks of CUDA cores. As a result 1 block of 16 CUDA cores is FP64 capable, while the other 2 are not. This gives NVIDIA the advantage of being able to employ smaller CUDA cores for 32 of the 48 CUDA cores in each SM while not removing FP64 entirely. Because only 1 block of CUDA cores has FP64 capabilities and in turn executes FP64 instructions at 1/4 FP32 performance (handicapped from a native 1/2), GF104 will not be a FP64 monster. But the effective execution rate of 1/12th FP32 performance will be enough to effectively program in FP64 and debug as necessary.

The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math. What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32. With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute a whole warp, but each quarter of the warp is done at full speed as opposed to ½, ¼, or any other fractional speed that previous architectures have operated at. Altogether GK104’s FP64 performance is very low at only 1/24 FP32 (1/6 * ¼).... it’s the very first time we’ve seen 1/1 FP32 execution speed.

Relativist · Mar 28, 2012

So with 1/4 the DP, you would need to run 4 times as long, and the smaller memory would restrict the problem size. Doesn't sound to bad to me...

Red Falcon · Mar 28, 2012

Relativist said:
So with 1/4 the DP, you would need to run 4 times as long, and the smaller memory would restrict the problem size. Doesn't sound to bad to me...

That's why Tesla GPUs exist and are sold the way they are.
GeForce cards since the Series 8 have been neutered in DP FLOPS.

It's not bad if you are running SP, but for serious DP, Tesla should be used.

1.1.2.3.5... · Mar 28, 2012

A lot of this thread could be misconstrued as the script of a new porn movie....or is that just me?

In any case, I know that a lot of Nvidia programmers write CUDA to get approximate solutions. These are then fed into regular cpus for the double precision calculations. For example, in any optimization algorithm for a function with a decent sized domain space, you can chop the domain into a lot of little sections and run a steepest decent in sp on the gpu. Then take those solutions and run it again on the cpu to get a higher level of precision. You still get a nice performance boost because most of the work is done in eliminating large sections of the domain as possible solutions.

A slightly more concrete example where this is used is with molecular modeling simulations, where certain force calculations need to be done in dp, but are first approximated in sp on the gpu and then quickly refined.

I guess, my point is that crippled dp is not the end of the story for gpgpu programming.

Red Falcon · Mar 28, 2012

A lot of this thread could be misconstrued as the script of a new porn movie....or is that just me?

If you want hot NVIDIA pr0n, go with Tesla, it can handle any steamy DP you throw at it.

DeathPrincess · Mar 29, 2012

1 said:
I guess, my point is that crippled dp is not the end of the story for gpgpu programming.

Thats why i'd like to see some benchmarks before drawing conclusions. In some heavy CUDA usage senarios you'll max out the memory (the transfer speed not the total ram) way before you'll get to 50% on the core (this happens to me lots). So DP/SP performance isn't really conclusive. The memory transfer speeds are higher than cards i'm using now, not sure about 500 series.

Red Falcon · Mar 29, 2012

DeathPrincess said:
Thats why i'd like to see some benchmarks before drawing conclusions. In some heavy CUDA usage senarios you'll max out the memory (the transfer speed not the total ram) way before you'll get to 50% on the core (this happens to me lots). So DP/SP performance isn't really conclusive. The memory transfer speeds are higher than cards i'm using now, not sure about 500 series.

What CUDA/HPC apps are you running that max out your VRAM transfer rates?

Red Falcon · Mar 30, 2012

DeathPrincess said:
The memory transfer speeds are higher than cards i'm using now, not sure about 500 series.

What CUDA apps and GPUs are you using that are running out of memory bandwidth?

Peteman100 · Apr 2, 2012

Any updates on this? Do we know if the 680 is faster than the 580?

tjmustard · Apr 4, 2012

http://www.brightsideofnews.com/news/2012/3/22/nvidia-gtx-680-reviewed-a-new-hope.aspx?pageid=4

Red Falcon · Apr 4, 2012

tjmustard said:
http://www.brightsideofnews.com/news/2012/3/22/nvidia-gtx-680-reviewed-a-new-hope.aspx?pageid=4

Thanks for sharing, it just didn't tell us anything we didn't really already know about the previous GPUs functions and performance.

Red Falcon · Apr 5, 2012

matthmaroo said:
I would just wait for the gk110 or radeon 89xx

Why? NVIDIA isn't going to up the performance of DP on their GeForce line anytime soon.
Instead of waiting forever, just go with one of the Tesla GPUs if you need DP performance.

DeathPrincess · Apr 5, 2012

Red Falcon said:
What CUDA/HPC apps are you running that max out your VRAM transfer rates?

Realtime video editing stuff seems to do this the most. Most the actual processing tasks in are light (AA/saturation) a few use a bit more grunt (mattes), but as more and more layers of stuff/processes pile up the vram transfer speed/throughput (or amount) seem to become the limiting factor while the GPU can get to about 25%.

Red Falcon · Apr 5, 2012

DeathPrincess said:
Realtime video editing stuff seems to do this the most. Most the actual processing tasks in are light (AA/saturation) a few use a bit more grunt (mattes), but as more and more layers of stuff/processes pile up the vram transfer speed/throughput (or amount) seem to become the limiting factor while the GPU can get to about 25%.

Yep, that would totally do it, especially with AA features/additions.
Thanks for letting me know.

GTX680 CUDA Performance?

Gawd

Limp Gawd

2[H]4U

Fully [H]

Supreme [H]ardness

Limp Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

n00b

n00b

Supreme [H]ardness

Weaksauce

[H]ard DCOTM December 2023

Weaksauce

[H]ard DCOTM December 2023

Weaksauce

[H]ard DCOTM December 2023

Weaksauce

Weaksauce

[H]ard DCOTM December 2023

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

[H]ard DCOTM December 2023

Gawd

[H]ard DCOTM December 2023

Fully [H]

[H]ard DCOTM December 2023

[H]ard DCOTM December 2023

[H]ard|Gawd

n00b

[H]ard DCOTM December 2023

[H]ard DCOTM December 2023

Fully [H]

[H]ard DCOTM December 2023