floating point operations per cycle

NeghVar

2[H]4U
Joined
May 1, 2003
Messages
2,671
How do I determine how many floating-point operations per cycle my CPU is able to do? I have the rest of the formula to determine the FLOPS.
sockets * (cores per socket) * (number of clock cycles per second) * (number of floating-point operations per cycle).
I am charting FLOPS from past supercomputers and seeing about when consumer-level computers (PC and smartphone) surpassed the performance of the supercomputers.
 
AVX is what you need to dig into. You'll have to analyze your processor architecture (to determine how many vector units can be processed in a single cycle, and how wide each is)

It also depends on what you are doing. If you're doing multiply-accumulate, you get twice as many operations per-clock (it's a trick most GPUs use)

But if you're not doing that, then just assume a single operations per-clock , per-data-slot in the AVX units. The amount each slot is taken up based on the size of that data type.

Single-precision FP is 32-bits.

So for a processor with 2 AVX 256-bit units, you get 256 + 256 = 512-bit total vector width, and the divide that by 32 to get the number of 32-bit slots, or the peak operations per clock.

512 / 32 = 16 slots available = 16 sp flops/cycle./
 
Last edited:
So for a processor with 2 AVX 256-bit units, you get 256 + 256 = 512-bit total vector width, and the divide that by 32 to get the number of 32-bit slots, or the peak operations per clock.

512 / 32 = 16 slots available = 16 sp flops/cycle./

Don't forget FMA! Every AVX op counts as two FLOPS if the processor supports FMA.
 
Don't forget FMA! Every AVX op counts as two FLOPS if the processor supports FMA.


I already talked about single-cycle multiply-accumulate. Read the whole post next time.

It's an operation mostly used by GPUs, but you can find some simulation that can use it as well - but I wouldn't automatically assume it's necessary for a server.

Hence why I said single flop/cycle per AVX slot
 
I already talked about single-cycle multiply-accumulate. Read the whole post next time.

It's an operation mostly used by GPUs, but you can find some simulation that can use it as well - but I wouldn't automatically assume it's necessary for a server.

Hence why I said single flop/cycle per AVX slot

Oops, didn't see that, sorry! I think single cycle MACs typically factor into supercomputer performance measurements, since dense matrix math makes great use of them.
I think the OP is mostly looking for FP performance data for historical purposes - for practical purposes vector FLOPS are somewhat tenuous since a lot of applications have computation patterns that aren't readily vectorized, or are completely bandwidth bound.

To answer the OP's questions:

Sandy Bridge/Ivy Bridge: 8 DP FLOPS/cycle
Haswell/Zen2/Skylake client: 16 DP FLOPS/cycle via single cycle MAC
Skylake-SP: 32 DP FLOPS/cycle
 
Back
Top