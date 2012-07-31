It's certainly possible that future x86 ISA extensions will be able to integrate arrays of ALUs which are already on most cores nowadays. There are just several problems standing in the way...Would the operation be any more efficient being routed to the GPU? A stream of dependent executed instructions work well in the CPU paradigm. There are execution ports where micro-ops can be scheduled and queued often out of order, and untangled afterwards. That is very much unlike how efficiency is extracted from GPUs, whether on die or off. For example, if there is dependence on the FP operation result for memory access to have valid data, or flow control depends on the outcome, the overhead of configuring the GPU to perform calculations and receive results back can be large (full cache coherency could help if bundles were sent on locked cache lines, but that limits data size and can have other effects on performance). To make a higher level set of instructions to perform those operations asynchronously on the GPU is something that's already available via different GPGPU APIs, which has additional benefits.Also, the results of GPU operations aren't necessarily precise as CPU SIMD and FPU operations. The behavior of particular errors is also much different between the GPUs and CPUs. IOW, the ALUs in a GPU aren't direct replacements for the CPU's SIMD unit(s) and FPU. It's not insurmountable, but it requires a somewhat different programming style.It's hard to imagine how huge arrays of ALUs would be effectively integrated into a CPU at a low ISA level in any way that would let it be general/efficient across architectures. GPUs thrive when given lots of non-dependent work, with many threads at once to hide memory latency, and (modern x86) CPUs thrive on extracting parallelism on general code, but still has relatively low IPC.Xeon Phi (Larrabee family incl. actual products Knights Ferry/Landing/etc) uses multiple wide SIMD units per core (each 512b wide with 32 vector registers), so it skips many of the problems an uncore GPU faces. It's similar to AVX/other SIMD programming. It seems to be doing quite well in HPC and rendering farms, unsurprisingly exactly the markets it targets.It would surprise me more to see GPUs getting a low level x86 extension on CPUs, than to see mainstream x86 CPUs getting multiple parallel SIMD units (possibly even based on a stripped down GPU SM unit) a la Xeon Phi. Why? Because GPU code, in order to run efficiently on multiple generation/architectures, is abstracted and handling that at the API level is better for the types of workloads where it excels. Note that we don't have an x86 3D rendering instruction extension. (Sometimes bad convergence is bad.)tl;drIt's more likely that CPUs will get more/wider SIMD units than iGPUs getting a low level x86 extension.