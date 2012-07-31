Why dont they just offload all floating point operations to the GPU in APU's?

L

leeleatherwood

[H]ard|Gawd
Joined
Sep 6, 2011
Messages
1,582
I am by no means a CPU architect or expert, but I have a question about the architecture of APU's.

Since GPU's are order's of magnitude better at floating point operations, and they now have GPU's integrated into CPU's, why not just eliminate the floating point parts of the CPU completely and offload all floating point operations to the GPU? Bulldozer already cut them in half, but they did not offload any of it to the GPU. I have a feeling this would be the next logical step in CPU/APU evolution.

Again, I am by no means an expert or anything.
 
Tsumi

Tsumi

[H]F Junkie
Joined
Mar 18, 2010
Messages
13,228
To do that with current APU technology means completely rewriting software, which most software developers are unwilling to do due to the amount of time and money it would take. The CPU and iGPU still appear as two separate entities. For that to really happen on a large scale, AMD has to, and is going to, make the CPU and iGPU appear as one entire entity, and it would automatically route the code along the correct pathways with little or no effort on the software side. AMD's roadmap has this happening ~2-3 years from now.
 
Zarathustra[H]

Zarathustra[H]

Official Forum Curmudgeon
Joined
Oct 29, 2000
Messages
29,563
Could it have something to do with the instruction set?

I'd imagine that the GPU's are not compatible with the x87 subset of the x86 instructions used for FPU's, and would thus have to run some sort of abstraction layer, virtual machine or translator between the two in order to function properly.

It's probably not out of the question to equip GPU's with x87 instruction set compatibility, but it would be a brand new design, and probably a step backwards in efficiency from current GPU designs.

Its an interesting concept though. I'd be curious to hear what people on here who know more about this type of thing than I do have to say about it.
 
P

pelo

2[H]4U
Joined
Apr 23, 2011
Messages
2,911
Since GPU's are order's of magnitude better at floating point operations
Click to expand...
Not all operations. But theoretical throughput is much better, yea.

and they now have GPU's integrated into CPU's, why not just eliminate the floating point parts of the CPU completely and offload all floating point operations to the GPU? Bulldozer already cut them in half, but they did not offload any of it to the GPU. I have a feeling this would be the next logical step in CPU/APU evolution.
Click to expand...
You'd need recompiling and because GPUs don't speak x86 (x87 :D minus Larrabbee 1.0 and 2.0). You need a programming language that can encompass both sides of the APU, like openCL. The APUs must also share resources within the CPU itself to cut out the meddling with the RAM or be able to access the same store points (cache or DDR). This is what's called HSA, or heterogeneous systems architecture. It's purpose is to allow for the CPU and GPU on the same chip to essentially talk to each other in the same underlying language (openCL) and work together in executing tasks (with shared resources).

AMD is taking large leaps in this respect come Steamroller/Kaveri, though Trinity already shows good openCL performance.

Hope that helps ;)

edit - the purpose of this is to take advantage of an ever-growing GPU. We've already seen that software is ages behind the hardware and the software optimizations just aren't coming along at the rate Intel/AMD have hoped. We've also grown more dependent on the visual aspects of computing, whether gaming or sparkly UIs and increasing resolutions, so a bigger on-die GPU just fits better for most consumers. The rise of GPU computing in HPC has also sputtered this along. So you might as well start using that big GPU to help out with some of the computational tasks that people would see on a daily basis rather than just drawing triangles and sitting idle 95% of the time just taking up space. Intel hasn't shown what they'll do yet, but my guess is if MIC takes off then we'll see x86 GPU cores inside Intel's chips.
 
Last edited:
L

leeleatherwood

[H]ard|Gawd
Joined
Sep 6, 2011
Messages
1,582
Very nice, so AMD already has this in the works. I thought so, it just makes more sense. Hopefully in the 2-3 years time AMD can drastically improve its IPC to be comparable to Intel, I believe they would have one hell of a monster on their hands at that point.
 
A

Arcygenical

Will Watercool for Crack
Joined
Jun 10, 2005
Messages
24,912
I don't doubt that this may be a future strategy... But FP calculations on the GPU are handled by an API for the routing of data... We need to get beneath this, necessitating the re-writing of software, or low level logic (and quite possibly both).
 
Zarathustra[H]

Zarathustra[H]

Official Forum Curmudgeon
Joined
Oct 29, 2000
Messages
29,563
Arcygenical said:
I don't doubt that this may be a future strategy... But FP calculations on the GPU are handled by an API for the routing of data... We need to get beneath this, necessitating the re-writing of software, or low level logic (and quite possibly both).
Click to expand...
If we are going to do that, we might as well retire x86 and move towards a new - more efficient - instruction set all together.

I'd rather rip the bandaid off all at once, than bit by bit (or like apple did it by ripping it off, putting it back on again, and then ripping it back off several times in a few years time :p

They certainly made their developers good at porting by going Motorola 68k -> PowerPC -> OSX -> x86 in not THAT long a time period :p
 
P

pelo

2[H]4U
Joined
Apr 23, 2011
Messages
2,911
One word: legacy.

Outside of the desktop/server that doesn't matter much, seeing as how everything is essentially ARM anyway.
 
Zarathustra[H]

Zarathustra[H]

Official Forum Curmudgeon
Joined
Oct 29, 2000
Messages
29,563
I also wonder if this would be a problem for gaming.

3D game engines are typically pretty FPU intensive, if I am not mistaken, so if you were playing a game, that also was rendering to the GPU portion, and that portion was shared for both aspects of FPU operations, you'd think the shared FPU would quickly become a bottleneck, no?

That being said, even if you can't cut down on the number of transistors (and thus the chip space used, power consumption and heat generation) there are still benefits to be had from sharing the FPU function between the GPU and the CPU. If they are shared, even if combined the same size as before, when the GPU is not rendering, the CPU will have a lot more FPU processing capability than before and vice versa. It would be more efficient that way, even if you wouldn't be able to take full advantage of it when both need the resource.
 
Last edited:
D

Digital Viper-X-

[H]F Junkie
Joined
Dec 9, 2000
Messages
13,821
Zarathustra[H];1038993012 said:
If we are going to do that, we might as well retire x86 and move towards a new - more efficient - instruction set all together.

I'd rather rip the bandaid off all at once, than bit by bit (or like apple did it by ripping it off, putting it back on again, and then ripping it back off several times in a few years time :p

They certainly made their developers good at porting by going Motorola 68k -> PowerPC -> OSX -> x86 in not THAT long a time period :p
Click to expand...
OSX is not an instruction set :)
 
P

pelo

2[H]4U
Joined
Apr 23, 2011
Messages
2,911
Zarathustra[H];1038993170 said:
I also wonder if this would be a problem for gaming.

3D game engines are typically pretty FPU intensive, if I am not mistaken, so if you were playing a game, that also was rendering to the GPU portion, and that portion was shared for both aspects of FPU operations, you'd think the shared FPU would quickly become a bottleneck, no?
Click to expand...
You'd have access to the entire on-die GPU, though.

I guess what you're asking is: what about the extra FP operations that don't go to GPUs and are instead asked to be handled by the CPU's FPUs in your average game?

Nothing :p You can tack on ISAs onto the "GPU cores" of the CPU as well. In fact, we already do this with GPUs. They'll just be handled by X "GPU core." There's a transitional period, like we're seeing with Trinity and we'll see with Kaveri, where the chip has traditional FPUs + GPU cores/shader/vector units, but the goal is seamless integration. Bear in mind you can also throw in a GPU as well to help with the computation.
 
D

David_CAN

Limp Gawd
Joined
Aug 15, 2011
Messages
186
There are two flaws with this plan.

The first is that the GPU is a very long way away from the CPU in computing terms; if I want to perform a floating point calculation, say add together a vector to see if a bullet has hit a character, I don't want to send that data all the way to the GPU then wait for the result to be sent back. In computing terms that might as well be a lifetime. Even the latency between the CPU and RAM is so high that the CPU pipeline stalls if it has to wait for something that wasn't already in the CPUs on die cache. Now in an APU where the CPU and GPU shared the same cache you might get past that problem, but you'd still be wasting time communicating between two cores just to perform a simple calculation.

And that gets into the second flaw, "GPU's are order's of magnitude better at floating point operations" is both correct and incorrect. If I want to perform the exact same floating point operation on 100 numbers then the GPU is faster then the CPU. If I want to perform a specific calculation (like the bullet detection) and then act on it then the CPU is faster. CPUs are very fast at general purposes calculations and that includes floating point math. GPUs for the moment are slower then CPUs by a big margin but they make up for it by executing the same instruction across many pieces of data - it takes twice as long before it is done, but produces 4 times the final output. Some things work well for this, many others don't.
 
P

pelo

2[H]4U
Joined
Apr 23, 2011
Messages
2,911
David, the goal isn't to replace the FPUs but rather seamless integration between the GPU cores, FPUs and ALUs on a single die.

If it's a "dumb" FP task, the GPU will get it

If it's a "complex" FP task then it's the FPU that comes to bat

Essentially how GPGPU is used for HPCs but instead of an off-die GPU you'd be leveraging the on-die GPU.
 
Zarathustra[H]

Zarathustra[H]

Official Forum Curmudgeon
Joined
Oct 29, 2000
Messages
29,563
pelo said:
David, the goal isn't to replace the FPUs but rather seamless integration between the GPU cores, FPUs and ALUs on a single die.

If it's a "dumb" FP task, the GPU will get it

If it's a "complex" FP task then it's the FPU that comes to bat

Essentially how GPGPU is used for HPCs but instead of an off-die GPU you'd be leveraging the on-die GPU.
Click to expand...
Exactly.

Right now, APU's are simply just a CPU and a GPU on the same die.

The idea is that they get integrated further than this, and are seen by the OS as the same device.

Beyond this I can't speak to how that works, as I am not a chip guy, but presumably there are quite a bunch of efficiency improvements to be garnered by doing something like this.
 
C

Cheetoz

[H]ard|Gawd
Joined
Mar 3, 2003
Messages
1,972
Why would a program need to know how a multiplication between two FP values is done in hardware? It just calls, say MULPS, and the result gets stored where its said to.

Would need to connect the gpu to the instruction and data bus, and have the instruction decoder use the same opcode to enable the gpu instead of an ALU.
 
D

defaultluser

[H]F Junkie
Joined
Jan 14, 2006
Messages
13,131
Cheetoz said:
Would need to connect the gpu to the instruction and data bus, and have the instruction decoder use the same opcode to enable the gpu instead of an ALU.
Click to expand...
And that's the big problem really - finding a way to marry two disparate types of schedulers and memory architectures without killing the performance of both.

The CPU is heavily-optimized for high ILP and has a low-latency memory architecture. The GPU is optimized for high DLP via massive SIMD units, and has a high-latency high-throughput memory architecture. It's much easier to keep them separate but equal.

Why would a program need to know how a multiplication between two FP values is done in hardware? It just calls, say MULPS, and the result gets stored where its said to.
Click to expand...
Because most GPUs would be HEAVILY underutilized if you just performed 4 parallel multiplications per-clock. To get the most out of an integrated GPU without adding tons of latency waiting for the results, you would need to implement a GPU scheduler in hardware, and that's no easy task. Right now companies can get away with implementing this complex beast in software because a few ms latency for rendering is no big deal, but it's an eternity in CPU time (a software scheduler would be the equivalent of running an Interrupt every time you want FP operations).

Or, you can just add new instructions like they did to make SSE/3D-NOW! work. But that will forever force your GPU architectures to maintain hardware backward compatibility, and that's a restriction the industry has always avoided (ditching compatibility in favor of maximum performance).
 
Last edited:
Pieter3dnow

Pieter3dnow

Supreme [H]ardness
Joined
Jul 29, 2009
Messages
6,789
OpenCL :) it is already here :).

To do it on an operating system level will need more things but basically you should be able to write any code you want and abuse the videocard for it (direct compute under DX).

But the if you are expecting that any OS will support a direct feature soon you will be disappointed, these things take time.
Check how many programs are optimized for videocard usage, just a few and not for that many years.
 
YeuEmMaiMai

YeuEmMaiMai

Death Incarnate
Joined
Jun 11, 2004
Messages
17,149
i'm gonna make this really simple to understand

if your graphics card is busy doing other things like CPU math, it has less time to actually do it's main job and that's to process graphics.....

The only time you won't see a performance drop is when you are in a situation where the GFX card has time to spare due to the CPU being weak and not feeding it geometry data fast enough.

since APU are already memory constrained as it is, you are NOT going to see any improvement by off loading math to it.

take a look at the AMD E series, you get more performancee simply by overclocking the memory bus when you install faster ram (1666Mhz for example)
 
Tsumi

Tsumi

[H]F Junkie
Joined
Mar 18, 2010
Messages
13,228
You're missing the point. While that is true for a system that only has an APU installed, what about a system that has an APU in addition to discrete GPUs?

Also, a very important aspect is balancing CPU and iGPU power. Turboboost helps achieve this, if the iGPU core is idling, the CPU is boosted to higher frequencies, and vice-versa.
 
J

JMccovery

2[H]4U
Joined
Feb 9, 2006
Messages
2,211
Tsumi said:
You're missing the point. While that is true for a system that only has an APU installed, what about a system that has an APU in addition to discrete GPUs?
Click to expand...
Thinking about this, if Nvidia was to fully port PhysX to OpenCL, an APU could help somewhat with those calculations. Even in full OpenCL applications, the APU can provide a pretty good performance boost.

I wouldn't mind a FX-level 6-12 core APU with 7750-level graphics (even if it has a TDP of 150W).
 
D

drescherjm

[H]F Junkie
Joined
Nov 19, 2008
Messages
14,677
I wouldn't mind a FX-level 6-12 core APU with 7750-level graphics (even if it has a TDP of 150W).
Click to expand...
I would like to see the TDP extended to ~200W for a > 4 GHz stock (I talking about all cores not only turbo) 10 core with APU. If that can't be air cooled include a water cooling unit standard. This would be excellent for the type of work I do (medical imaging research). That is if they can sell the chip for ~$400. I know I am probably dreaming here..
 
Last edited:
I

InternationalHat

[H]ard|Gawd
Joined
Aug 13, 2004
Messages
1,481
Tsumi said:
You're missing the point. While that is true for a system that only has an APU installed, what about a system that has an APU in addition to discrete GPUs?

Also, a very important aspect is balancing CPU and iGPU power. Turboboost helps achieve this, if the iGPU core is idling, the CPU is boosted to higher frequencies, and vice-versa.
Click to expand...
Yeah this would be killer.

Also
http://en.wikipedia.org/wiki/WebCL
Client side JS with WebCL bindings. This combined with web workers, client side MVC/MVV and websockets will allow you to perform heavy FP operations on the client side without the typical slowness of javascript or it locking the page. Basically, imagine better games and real time trading platforms with analytics and/or better business intelligence software all in the browser space. whaaaaaaaaat.

If AMD can get more buy-in and better integration, this shit's crazy.
 
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
Since you cannot create a grpu from a bulldozertype flexfpu amd is forever destined to go without flexiblepipeGPU.

I saw a graphics-demo of larrabee showing a landscape with mirrors facing eachother (at manheight and 2meters apart alike the "inception"-movie scene in paris/seine-river-bridge), and they said a non-"x86" gpu (nevermind risk architectures) will never be able to simulate these "ad-eterum effects" (as example).Perhaps intel-architectures will be suited for the creative work and amd-architectures for entertainment markets? I am an amd-fan, but as creative designer I always end up buying intel, so that would frustrate me.
 
Last edited:
S

Spazturtle

[H]ard|Gawd
Joined
Jan 4, 2013
Messages
1,526
What you have suggested is the end goal of HSA.

AMD want GPUs and CPUs to merge, the latest APUs are just the first step.

Intel will start making APUs with HSA at some point, they are investing a lot into their IRIS GPUs at the moment, next logical step is APUs.
 
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
@Spazturtle: I think amd-APUs can never be brought to do x86 graphics.
If larrabee-style x86 gpu consisting of a MIC-Array of size-reduced integer cores with shortened integerpipes (4 to 8 units long) replaced gpu-cores in intel-architecture, then that means a gpu doing floatingpointmath does not do x86 because it doesn´t consist of integer-cores.In other words:
Future AMD apu are more chiparea-efficient but only do floatingpoint/tesselation-graphics(=for entertainment-purposes only?) and larrabee-style intel-GPUs will do MIC/x86-graphics with reduced gflop-performance. So the new AMD architecture will be unable to do x86-graphics, but for classic CPU-tasks/classic x86integer, it will feature vastly improved performance per chiparea (additional integer-cores where the FPUs once took up chiparea/space). So amd will be more powerful classical APU and intel-while having less raw performance-will be completely x86, even in the GPU.
Thats why I said AMD for "consumers/gamers" and INTEL for "Creatives/Work".

The danger (that I perhaps needlessly fear) is that there comes a day even gaming-GPUs will be required to have x86.And in this scenario- the moment AMD saw that coming-AMD perhaps made that statement that they would stop competing with intel and go mobile/arm/risc.So that would be the ENDGAME of intel, to become monopolist on x86 graphics and AMD can´t do anything against it.(I hate to say it.)
 
Last edited:
H

haz_mat

Limp Gawd
Joined
Sep 14, 2012
Messages
324
To try and address the OP in another way: Not all floating point math is the same. GPUs excel at vector math, meaning that it can take a single instruction and execute that operation on multiple (read:many many) elements. A rough example might be something like telling the GPU to invert the colors of a pixel, but then pointing it to a large array of pixels to work on all at once.

CPUs have some of this functionality, but excel at doing a smaller amount of work in less time - this allows CPUs to churn through single-threaded workloads very quickly where the result of a single math operation might be making a decision for the entire program.

The APUs of the future must combine the strengths of both architectures. No one wants their GPU to choke on a very linear control thread, nor do we want the CPU to try and do a real-time render in software.
 
O

OmgitsSexyChase

Limp Gawd
Joined
May 1, 2013
Messages
194
Technology progression my friend, because if they did that now they would have nothing to blow your mind in 5 years
 
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
OmgitsSexyChase

I think you rather mean that we still cannot calculate perfection, we must evolve it instead, but that is ok as long as the process(we are) is sustainable
 
Last edited:
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
As XeonPhi (larrabee-successor) proves, It is possible to use an x86gpu as 32bit x86CPU, beacuse it´s an array of x86cores with short integerpipes(inclusive diespacetaking FPUs).So intel could make a unified(homogenous), both graphics and centralprocessing capable 32bit x86corearray-architecture while AMD will make a more traditional split(heterogenous) between 64bit integer-units and gpu/floatingpoint-units. But intel will at first keep 64bit -integercores alongside the x86gpu to counter AMD on this central processing high integer register size end("bit-edness").So the intel larrabee igpu with programmable flexible pipe will probably only be just a half-step in that direction.
 
Last edited:
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
Another point is that future graphic improvements will have less and less effect on looks, so gpu x86programmability is a new feature with new possibilitys and more effect than of more of thesame instead. X86GPUs aren´t as limited as "cuda" either.As I already mentioned,I guess the only way for AMD to beat Intels homogenous System Architecture are HSA´s(=Heterogenous System Architecture) broader integer register sizes (bit-edness), because homogenous system architectures are still 32bit.
 
Last edited:
L

luminousone

Weaksauce
Joined
Feb 13, 2011
Messages
124
Larrabee is garbage, their might be a few uses cases where it makes sense purely due to full x86 cores, but they are limited(such as primecoin mining *giggles*).

The problem comes down to how microthreading works, and Larrabee is utter bollocks in this area.

Software has to be written around wide vector operations, 512bits wide, heavy use of fused instructions, and instruction pairing rules that will lead to a large number of pipeline misses.

Now add in context switching times, which are really bad on x86_64+avx2, made worse when on memory interface shared by 80 or more other cores. And made even worse by the nature of the type of work this thing is intended for.

Common code compiled for one of these things might get 20% of peak performance, Heavily hand coded work might get upwards of 60% in real world software, and Heavily hand coded benchmarks might pass 80%.

This loses one of the great benefits that GPU cores have(and that HSA improves further on), on the AMD/Nvidia platforms the code can be pretty ignorant and still be fast. As the gpu will simply schedule masses of microthreads right next to each other and keep the pipeline nearly full every cycle with very very few stalls.
 
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
@Jimmyb-- As I am not a circuit-designer but a product-designer, would you please correct my statements instead of just criticizing a more"global"very bad Information, you are just lazy (and perhaps insulting) otherwise. @luminousone you were much more helpful, thanks.
 
P

pxc

Extremely [H]
Joined
Oct 22, 2000
Messages
33,064
It's certainly possible that future x86 ISA extensions will be able to integrate arrays of ALUs which are already on most cores nowadays. There are just several problems standing in the way...

Would the operation be any more efficient being routed to the GPU? A stream of dependent executed instructions work well in the CPU paradigm. There are execution ports where micro-ops can be scheduled and queued often out of order, and untangled afterwards. That is very much unlike how efficiency is extracted from GPUs, whether on die or off. For example, if there is dependence on the FP operation result for memory access to have valid data, or flow control depends on the outcome, the overhead of configuring the GPU to perform calculations and receive results back can be large (full cache coherency could help if bundles were sent on locked cache lines, but that limits data size and can have other effects on performance). To make a higher level set of instructions to perform those operations asynchronously on the GPU is something that's already available via different GPGPU APIs, which has additional benefits.

Also, the results of GPU operations aren't necessarily precise as CPU SIMD and FPU operations. The behavior of particular errors is also much different between the GPUs and CPUs. IOW, the ALUs in a GPU aren't direct replacements for the CPU's SIMD unit(s) and FPU. It's not insurmountable, but it requires a somewhat different programming style.

It's hard to imagine how huge arrays of ALUs would be effectively integrated into a CPU at a low ISA level in any way that would let it be general/efficient across architectures. GPUs thrive when given lots of non-dependent work, with many threads at once to hide memory latency, and (modern x86) CPUs thrive on extracting parallelism on general code, but still has relatively low IPC.

Xeon Phi (Larrabee family incl. actual products Knights Ferry/Landing/etc) uses multiple wide SIMD units per core (each 512b wide with 32 vector registers), so it skips many of the problems an uncore GPU faces. It's similar to AVX/other SIMD programming. It seems to be doing quite well in HPC and rendering farms, unsurprisingly exactly the markets it targets. :p

It would surprise me more to see GPUs getting a low level x86 extension on CPUs, than to see mainstream x86 CPUs getting multiple parallel SIMD units (possibly even based on a stripped down GPU SM unit) a la Xeon Phi. Why? Because GPU code, in order to run efficiently on multiple generation/architectures, is abstracted and handling that at the API level is better for the types of workloads where it excels. Note that we don't have an x86 3D rendering instruction extension. (Sometimes bad convergence is bad.)

tl;dr
It's more likely that CPUs will get more/wider SIMD units than iGPUs getting a low level x86 extension.
 
Pieter3dnow

Pieter3dnow

Supreme [H]ardness
Joined
Jul 29, 2009
Messages
6,789
I would say that at kernel level you can do a lot of things but since were stuck with Windows you can not hope that we get something more then what were seeing now where HSA features have to be programmed for.

In the end the software has to make the difference either way like PXC said some workloads do not make sense on the gpu.
 
B

Bennewitz

n00b
Joined
Feb 18, 2014
Messages
9
Skylake is almost here and I cannot see a revolutionary x86 programmable flexpipe gpu implemented in it, I was informed wrong by you guys about the skylake gpu which lead me to warning against a "non x86-programmable Floatingpoint-GPU APU" for not being competitive anymore. X86 GPUs could still be 10 years out, how would I know.
It just states it is a 74ExecutionUnits Iris GT4e IGpu here (1 Teraflop) in the lower table-sounds very familiar/common/not special and I am sorry, but there is not one single word about x86 graphics.
 
Last edited:
Lunas

Lunas

[H]F Junkie
Joined
Jul 22, 2001
Messages
9,866
I thought this was the point of hsa and everything that amd introduced with kavari and essentially was amd's plan from the very start. They don't concentrate on making a faster fmac and instead on multithread performance.


the answer to the original question is because software is not programmed to utilize it and to change things takes time...
 
You must log in or register to reply here.
Top