Why dont they just offload all floating point operations to the GPU in APU's?

Derfnofred · Aug 28, 2015

-Dragon- said:
GPU's are in no way "orders of magnitudes more efficient" at FP math than CPUs, they just have orders of magnitudes more cores.

TL;DR: The design of the FPU in a GPU vs CPU are two entirely different beasts, CPU needs small number of problems solved in a short time, GPU needs a large number of problems solved independent of each other but less time constraint on each individual problem.

Agree with almost all your post except the first part (and only when framed in a different light than perhaps the direction you're taking it), in that FLOPS/watt, one very useful form of efficiency, is generally higher on a GPGPU than a CPU due to the simplified architecture, seeing as they're designed expressly around SIMD. Likewise, you could say something like flops/cm^2 of die area is favorable to GPGPU for the same reason--it's far less flexible in hardware.

Amdahl's Law strikes again when you look at computational needs of a consumer-centric processor: floating point operations tend to either be sequential in nature, or media/graphically-based and massively parallel, which is where we already sit (by and large). Places where we'd likely see the greatest accelerations in software performance from a bunch of slightly-more-flexible-yet-still-massively-parallel FPU's would be in content creation software (think Adobe's suites), spreadsheet/database applications. But your scheduler and/or compilers are going to get a LOT more complex to do it.

A poster mentioned we're more likely to see more SIMD optimizations in our CPUs than GP optimizations in our GPUs--I agree, and even more so, we might start seeing more big.LITTLE type solutions if the schedulers can handle it efficiently.

Tsumi · Aug 28, 2015

Derfnofred said:
Agree with almost all your post except the first part (and only when framed in a different light than perhaps the direction you're taking it), in that FLOPS/watt, one very useful form of efficiency, is generally higher on a GPGPU than a CPU due to the simplified architecture, seeing as they're designed expressly around SIMD. Likewise, you could say something like flops/cm^2 of die area is favorable to GPGPU for the same reason--it's far less flexible in hardware.

Amdahl's Law strikes again when you look at computational needs of a consumer-centric processor: floating point operations tend to either be sequential in nature, or media/graphically-based and massively parallel, which is where we already sit (by and large). Places where we'd likely see the greatest accelerations in software performance from a bunch of slightly-more-flexible-yet-still-massively-parallel FPU's would be in content creation software (think Adobe's suites), spreadsheet/database applications. But your scheduler and/or compilers are going to get a LOT more complex to do it.

A poster mentioned we're more likely to see more SIMD optimizations in our CPUs than GP optimizations in our GPUs--I agree, and even more so, we might start seeing more big.LITTLE type solutions if the schedulers can handle it efficiently.

In addition to FLOPs/watt, don't forget about FLOPs/transistor and FLOPs/mm^2 as well. A GPU core is much snaller than a CPU core.

Derfnofred · Aug 28, 2015

Yeah, I think you missed the sentence where I put it in terms of flops/cm^2 (which I'm 99% sure ISN'T the unit used, but whatever), but worth reiterating.

pxc · Aug 28, 2015

Bennewitz said:
http://www.techpowerup.com/img/15-08-18/77a.jpg

skylake does not seem to have x86 MIC-GPU.

I must have missed the rumor that Skylake was getting x86 based EUs. I would have dismissed it anyways since it wouldn't make much sense for something that's primarily a GPU for games.

Xeon Phi uses x86 MIC (Airmont Atom with wide SIMD units added), and if there's MIC convergence with integrated graphics it's more likely to happen at the 10nm node. Other than an ARM company Intel bought a while ago (ZiiLABS) and the recent Altera purchase, the only MIC architecture Intel makes and sells is Xeon Phi.

cocdod · Aug 29, 2015

Because there's a ton of latency incurred by dispatching to the GPU, FPUs are very nearby on the die while a GPU is completely off the CPU and somewhere else on the board.

You can still do it though - anyone who wants to do massively parallel floating point calcs can already offload it to the GPU just by using CUDA/OpenCL kernels or compute shaders.

Red Falcon · Sep 1, 2015

David_CAN said:
There are two flaws with this plan.

The first is that the GPU is a very long way away from the CPU in computing terms; if I want to perform a floating point calculation, say add together a vector to see if a bullet has hit a character, I don't want to send that data all the way to the GPU then wait for the result to be sent back. In computing terms that might as well be a lifetime. Even the latency between the CPU and RAM is so high that the CPU pipeline stalls if it has to wait for something that wasn't already in the CPUs on die cache. Now in an APU where the CPU and GPU shared the same cache you might get past that problem, but you'd still be wasting time communicating between two cores just to perform a simple calculation.

How did we manage to do this back in the 1980s and 1990s when FPUs were a separate chip from the CPU?
For example, the 80386 and 68030 had separate FPUs, 80387 and 68881/68882 respectively.

We were able to use those back then, so why is this suddenly "a lifetime of waiting" now for processing?
Heck, even up until 2003 with the introduction of the AMD Athlon 64, the PMMU on a separate chip (normally the north bridge in the 1990s and 2000s), and with Intel, it wasn't until 2008.

Granted, latency would definitely be a factor, but there are ways around that, obviously, as we have had to work with this technology for decades now.

EDIT: Just realized how old this thread is and that his post was from 2012; still, the info above stands true.

defaultluser · Sep 1, 2015

Red Falcon said:
How did we manage to do this back in the 1980s and 1990s when FPUs were a separate chip from the CPU?
For example, the 80386 and 68030 had separate FPUs, 80387 and 68881/68882 respectively.

We were able to use those back then, so why is this suddenly "a lifetime of waiting" now for processing?
Heck, even up until 2003 with the introduction of the AMD Athlon 64, the PMMU on a separate chip (normally the north bridge in the 1990s and 2000s), and with Intel, it wasn't until 2008.

Granted, latency would definitely be a factor, but there are ways around that, obviously, as we have had to work with this technology for decades now.

EDIT: Just realized how old this thread is and that his post was from 2012; still, the info above stands true.

For the original 8088/8087, the processors were actually independent. They each read in the instructions as they were fetched, and would only execute their own instructions. Of course, this was complicated by when you needed to get data from a memory address to satisfy the 8087 instruction.

The processors were "independent," but only handled a single instruction at a time between them. As far as I can tell, they waited for each instruction to be retired before a new one could be processed. Also, thanks to the 8-bit bus and dog-slow memory, the pair of processors were usually waiting on I/O rather than themselves.

80386: It worked because the processors were not pipelined, and had no on-chip cache. This meant that most operations were waiting on main memory, which was much slower than inter-processor communications. The external cache and main memory were available over the same shared bus.

The move to the 486 (pipelined, on-chip L1 cache) meant they had to put the FPU on-die, or let it slow down the rest of the system! And this still stands today.

Assuming you have a huge block of FPU instructions you can perform in one mass parallel run on the GPU: communication over an external bus is slow, the GPU is also running slower than the processor clock and you'd have to verify that there's no dependencies on the results of this data for the hundreds of cycles you'd be waiting to get the data back. In other words: highly unlikely.

Digital Viper-X- · Sep 1, 2015

I have asked this before, the answer I remember is latency.

pxc · Sep 1, 2015

The 386 DX and SX had data pipelining, used to prefetch data in a 16 byte buffer. I remember that because there was usually an option in BIOS to turn off "pipelining". But in the other context, like the 80286, the 80386 had 3 pipeline stages (fetch, decode, execute), but was far more advanced with look ahead and execution queuing.

The interface between the 8086/8088 and 8087 was a bit strange. When the 8086 decoder detected an FPU instruction, the 8086 suspended fetch and allowed the 8087 to take over. The 8087 had a full set of address and data pins like the 8086, and passed back control to the main processor after it was finished. It was very slow and inefficient. Later x87 FPUs could better overlap with CPU code execution.

One oddball were the Weitek FPUs. Those operated much differently from x87 FPUs. They were memory mapped, operated by writing to particular memory addresses, and reading the results from another. The pinout was different from x87 processors, and you may see those on older motherboards as larger coprocessor sockets, or as what looks like a large 80387 coprocessor socket on some 486 boards.

Plenty of processors do efficiently use coprocessors as accelerators. If you have a phone, you have one right there. An on package coprocessor or discrete DSP ASICs usually have special purposes. For general code execution, you have 2 models on the PC: latency based (CPU) or throughput based (GPU). There is no magic bullet to solve the problems of seamlessly extending CPU capabilities. Add a new CPU extension, which is possible as CPU and GPU integration becomes even tighter, or use other APIs to speed applications which can be accelerated.

defaultluser · Sep 1, 2015

pxc said:
The 386 DX and SX had data pipelining, used to prefetch data in a 16 byte buffer. I remember that because there was usually an option in BIOS to turn off "pipelining". But in the other context, like the 80286, the 80386 had 3 pipeline stages (fetch, decode, execute), but was far more advanced with look ahead and execution queuing.

Yeah, I was specifically referring to execution pipelineing. Even the 8088 had a primitive 4 byte prefetch buffer, or it would have choked on that pathetic bus

The 486 was the first x86 CPU with a fully-pipelined integer execution, and it needed that massive L1 cache t feed it.

EDIT: according to this they made the 80387 more independent:

https://books.google.com/books?id=Y...Qgkf#v=onepage&q=how to use the 80387&f=false

But that only carries if you don't need to interact with your data very often. Again, the dependency issue rears it's ugly head, and you'd have o stall the 80386 in that case.

The 486 simplified this by only executing one instruction at a time. And even though the FPU execution wasn't pipelined, it was pretty fast compared to the 80387. This meant that if you had written code optimized for the 80386/80387 (long swaths of integer interspersed with FPU), it would still perform brilliantly. The pipelines integer unit would mow through the x86 code, and then the x87 instructions would take less time to execute than the 80387 did. And you didn't have to stall on dependencies.

pxc · Sep 1, 2015

defaultluser said:
Yeah, I was specifically referring to execution pipelineing.

The execution stage on both the 386 and 486 were only one pipeline stage. Both processors used queues to keep the ALU fed.

edit: just to clobber this...

Intel 386 hardware manual said:
Pipelined architecture enables the 80386 to perform instruction fetching, decoding, execution,
and memory management functions in parallel. The six independent units that makeup
the 80386 pipeline are described in detail in Chapter 2. Because the 80386 prefetches
instructions and queues them internally, instruction fetch and decode times are absorbed in
the pipeline; the processor rarely has to wait for an instruction to execute.

defaultluser · Sep 1, 2015

pxc said:
The execution stage on both the 386 and 486 were only one pipeline stage. Both processors used queues to keep the ALU fed.

edit: just to clobber this...

That's the description of a loosely-pipelined architecture. Each stage works independently. Each stage can only do one thing at a time.

The tightly-pipeliend execution stage of the 486 has higher throughput because (for simple instructions) you can execute a single instruction each clock cycle. Simple instructions took twice as long on the 386, since you could only operate on a single instruction at a time.

For example, a simple instruction like Add takes two clock cycles on the 386 (and that's all the execution unit can do):

http://pdos.csail.mit.edu/6.828/2012/readings/i386/ADD.htm

That's the lowest-latency instruction on the ALU.

But the 486 can perform pipelined execution, with a throughput of 1 instruction per clock most of the time. It can simultaneously execute multiple instructions in the 3-stage ALU pipeline. It also seems to have 2-stage decode, to speed up that portion of the chip.

See here:

https://www.cs.uaf.edu/2013/fall/cs441/Pres/Intel_80x86.pdf

As long as you have no dependencies, thee 486 will execute x86 code up to twice as fast at the same clock speed. But again the x87 unit is not pipelineed.

pxc · Sep 1, 2015

defaultluser said:
That's the description of a loosely-pipelined architecture. Each stage works independently. Each stage can only do one thing at a time.

The tightly-pipeliend execution stage of the 486 has higher throughput because (for simple instructions) you can execute a single instruction each clock cycle. Simple instructions took twice as long on the 386, since you could only operate on a single instruction at a time.

No, it's not. It's just plain pipelining, where one discrete stage feeds into the next one, with movement controlled by a clock.

Performance enhancements happen in each generation, and are not the defining charactistic of whether a processor uses a pipeline or not. The 80286 had a 3 stage pipeline, just like the 80386, but the 80386 had much higher throughput due to prefetching and execution queuing. Instruction latency isn't what determines whether execution is pipelined or not either. That's just grasping.

defaultluser · Sep 1, 2015

pxc said:
No, it's not. It's just plain pipelining, where one discrete stage feeds into the next one, with movement controlled by a clock.

Performance enhancements happen in each generation, and are not the defining charactistic of whether a processor uses a pipeline or not. The 80286 had a 3 stage pipeline, just like the 80386, but the 80386 had much higher throughput due to prefetching and execution queuing. Instruction latency isn't what determines whether execution is pipelined or not either. That's just grasping.

Never mind. We're arguing over nothing.

Lunas · Sep 8, 2016

Bennewitz said:
I am no Chipdesigner, just some clueless simple, but when AMD went from Excavator to Zen-Architecture they also shifted back from flex-Fpu to traditional corededicated FPU design alike intel,so I guess this philosophy shift from experimental/alternate to classic approach may have also killed the potential for an AMD " context-switching GPU/FPU-combounit"as discussed/proposed in this thread.
I mean its still feasible anytime but It probably will no longer fit the enterprise-philosophy out of fear it would run into similar computation delay problems in coordination between units -which seems to be the reason for killing the flex-FPU after excavator. I always thought Resonant hybrid Clockmesh Tech would counteract said problem, but excavator and zen seem to have dropped this technology.

being innovative and new was not paying the bills they saw how sticking with the bulldozer family was going to kill them.

the head of chip design said something to the effect of the bulldozer architecture was a mistake that would take 5 years to fix well guess what 2016 marks the 5th year so zen is the fix to hopefully get them back on track.

TBH they need a parallel line of products to run experiments on that they could drop if it doesnt work out like mobile SOC or a coprocessor card that could off load cpu and gpu tasks to it that people could install in any pcie slot and it would work out like a second APU was added to the machine so if you have a machine with a soldered in apu a much more powerful card could be added that improves overall system performance....

Bulldozer>Pile-driver>Steamroller>Excavator

they knew when bulldozer was developed it was not going to workout how they wanted and the perf was not up to competing with intel who stuck with the tried and true technology...

serpretetsky · Sep 11, 2016

Bennewitz said:
MY NAME IS BENNEWITZ, I NECRO THE SAME THREAD OVER AND OVER

Why do you do this Bennewitz. WHHHYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

gamerk2 · Sep 12, 2016

-Dragon- said:
GPU's are in no way "orders of magnitudes more efficient" at FP math than CPUs, they just have orders of magnitudes more cores. To take two extreme examples, the 5960x can do 187.27GFLOPS in whetstone, where a Fury X is rated at 8600GFLOPS, however the 5960x only has 8 cores @ 3.5(turbo)GHZ to the Fury X's 4096 cores @ 1.05GHz. Each Fury X core is only capable of doing ~2 floats per cycle with a 1ns response time where the 5960x can do ~6.7 floats per cycle with a 0.28ns response time.

So even in a magical world where you have a Fury X as your iGPU and can instantly transmit the instruction from the CPU end to the iGPU end and back again with no penalty, it would still take you over 3x as long to get the result from the iGPU than it would for the CPU to just do it itself. This just gets worse when the problem is a linear one where you can't calculate the next step until the first one is complete, which is where CPUs excel. Even if you're talking about an AMD CPU with it's much poorer FPU performance, unless you dedicate a whole lot of silicon space to enable each core to instantly dump all relevant info from deep in its own pipeline straight into a floating point unit a half inch away or more on the chip, there's going to be huge latency involved with trying to transmit data for small batches of floating point math and the amount of silicon real estate that would be required to make it not suck as much would be better spent just putting the floating point unit in the CPU itself.

TL;DR: The design of the FPU in a GPU vs CPU are two entirely different beasts, CPU needs small number of problems solved in a short time, GPU needs a large number of problems solved independent of each other but less time constraint on each individual problem.

Just going to quote this, as is basically answers the question. GPUs and CPUs are entirely different beasts, and you shouldn't try to mix and match without a damn good reason.

Bennewitz · Sep 19, 2016

You could use two coordinated clockmeshs
One for Integerside cores and FPGPU separated in frequency by factor of about 2 or 3 .

But more importantly the ifp gpu (and with it the whole cpu/apu) could maybe not use the RAM at all but HBM for a higher clock frequency within the fpgpu.

But i have no idea of chipmaking, really. There will be errors in my thinking . I am just trying to inspire here somehow .

CPU Frequencyboost-mode would have to be completely deleted for coordinating the CPU with its new Floatingpointunit in the second clockmesh without producing any delays.

Low-Power Floatingpoint-operation would prefer GPUfrquencies compared to Cpu frequencies so since gpus have lower frequencies than cpus the chip wouldn´t tend to be inefficient.

The current Zen-Architecture is very well suited for the APU Architecture proposed in this thread. AMD traditionally combined many pipes of thesame type in a row within one core (typical AMD industrialisation/generalist powerhouse approach versus Intels classic fine specialisation and customisation approach)
On die the integerside cores would have to form a outer area ("brainskin") around a center fpgpu field alike the human brainskin containing the logic layers , which is the reason our brains look valleyed from above , making the logic surface grow (or thge processor could also use a interleaved combs layout for lengthening the gpupipes for more performance , but that is less analogue to our own evolution and tradition ) You can see ourselves as impulsebased networks within neural mesh, nearly digital but using electric impulse timing instead of 1 and 0s .

Phenomenologically the Consciousness by such architecture would be light centered just like ourselves

Pieter3dnow · Sep 19, 2016

Bennewitz said:
Apologies, Serpentedsky.

Intels and Nvidias GPGPUs have not been selling well in broad customer- market unable to compete powerwise with Nvidia and Radeon (on an opportunistic note, stop reminding me I fell for the x86 hype)

Back on the contextswitching "iFPGPU" topic, The true Purpose of integrating a bigger GPU instead of the APUs classic FPUs should be to allow more "solo" integerpipes on thesame chiparea.

To determine wether this Threads "iFPGPU" Architecture increases or decreases Chipsurface compared to normal APUs,
We have to know if the chipsurfaceefficiency is:

-Frequency dependant (advantage tending towards highfrequency classic FPU) or

Flop-performance dependant (advantage tending towards "iFPGPU"-which is my guess):

- Then, we still have to check if there is any contextswitching delay when using the new huma contextadress memory system along with a two coordinated resonant clockmeshs for IntegerCPU and iFPGPU separated in frequency by a single digit factor.

(... killing any asymmetric boostfunction but this is not an issue in low-Power APUs, also, Low-Power 3Dtransistor Floatingpoint-operation would prefer GPUfrquencies compared to Cpu frequencies.)

The current Zen-Architecture is very well suited for this new APU Architecture. AMD traditionally combined many pipes of thesame type in a row within one core (typical AMD industrialisation/versatility-approach versus Intels classic fine specialisation and customisation approach)

There is a very simple rule for this : If it was that easy it would have been done already

There are some things which are obvious improvements but that does not mean that it is not possible but rather not feasible at this point in time.
If you want to look at the approach where AMD requires companies to make it HSA compatible but it has to be programmed for

This groundbreaking technology allows CPU and GPU cores to speak the same language and share workloads and the same memory. With HSA, CPU and GPU cores are designed to work together in a single accelerated processing unit (APU) , creating a faster, more efficient and seamless way to accelerate applications while delivering great performance and rich entertainment.

it does allow some benefits but this is far from a hardware solution, you could wonder what can be done in the BIOS but this is so far removed from a mainstream solution that even to date there are no other chips which are HSA compatible beside from AMD. Unless that takes of the other solutions don't make sense

Bennewitz · Mar 25, 2020

If windowsRT was used, could we save the unacceptably slow "knights landing/larrabee/MIC multiple integrated core" architecture if adapted thesame concept to the faster RISC instructionset instead ?

Tsumi · Mar 25, 2020

Just... no. If it were that simple, they would have done it already. Chip designers don't look at these forums for chip design ideas anyways, so any speculation you're making is entirely useless.

Zarathustra[H] · Mar 25, 2020

Holy Necro Batman.

I commented on this thread back in 2012.

Someones social distancing must really have resulted in them finishing the internet.

This question has already been answered.

GPU's are really good at highly parallel workloads. FPU calculations CAN be parallel in nature, but don't have to be, just like integer workloads. There is nothing about general FPU workloads that make them more suitable to be run on a GPU.

Down the road I wouldn't be surprised if compilers start automatically identifying portions of code that would be better suited for GPU execution and automatically offload those portions to the GPU, but this has nothing to do with FPU vs Integer math.

dgz · Mar 28, 2020

Zarathustra[H] said:
Down the road I wouldn't be surprised if compilers start automatically identifying portions of code that would be better suited for GPU execution and automatically offload those portions to the GPU, but this has nothing to do with FPU vs Integer math.

I think that won't do unless the compiler is doing its job on the client machine. It can't know in advance what kind of GPU, if any, the target machine has.

Why dont they just offload all floating point operations to the GPU in APU's?

Gawd

[H]F Junkie

Gawd

Extremely [H]

Weaksauce

[H]ard DCOTM December 2023

[H]F Junkie

[H]F Junkie

Extremely [H]

[H]F Junkie

Extremely [H]

[H]F Junkie

Extremely [H]

[H]F Junkie

[H]F Junkie

2[H]4U

2[H]4U

n00b

Supreme [H]ardness

n00b

[H]F Junkie

Extremely [H]

Supreme [H]ardness