floating point operations per cycle

NeghVar · Mar 30, 2020

How do I determine how many floating-point operations per cycle my CPU is able to do? I have the rest of the formula to determine the FLOPS.
sockets * (cores per socket) * (number of clock cycles per second) * (number of floating-point operations per cycle).
I am charting FLOPS from past supercomputers and seeing about when consumer-level computers (PC and smartphone) surpassed the performance of the supercomputers.

tangoseal · Mar 31, 2020

Linpack

whateverer · Mar 31, 2020

AVX is what you need to dig into. You'll have to analyze your processor architecture (to determine how many vector units can be processed in a single cycle, and how wide each is)

It also depends on what you are doing. If you're doing multiply-accumulate, you get twice as many operations per-clock (it's a trick most GPUs use)

But if you're not doing that, then just assume a single operations per-clock , per-data-slot in the AVX units. The amount each slot is taken up based on the size of that data type.

Single-precision FP is 32-bits.

So for a processor with 2 AVX 256-bit units, you get 256 + 256 = 512-bit total vector width, and the divide that by 32 to get the number of 32-bit slots, or the peak operations per clock.

512 / 32 = 16 slots available = 16 sp flops/cycle./

tangoseal · Mar 31, 2020

There are charts galore of this on Google ... here is a search link

https://www.google.com/search?q=cha...AKHZeHBcAQ9QEwAHoECAQQBQ#imgrc=Gqee1JIFKoY98M

bwang · Apr 21, 2020

whateverer said:
So for a processor with 2 AVX 256-bit units, you get 256 + 256 = 512-bit total vector width, and the divide that by 32 to get the number of 32-bit slots, or the peak operations per clock.

512 / 32 = 16 slots available = 16 sp flops/cycle./

Don't forget FMA! Every AVX op counts as two FLOPS if the processor supports FMA.

whateverer · Apr 21, 2020

bwang said:
Don't forget FMA! Every AVX op counts as two FLOPS if the processor supports FMA.

I already talked about single-cycle multiply-accumulate. Read the whole post next time.

It's an operation mostly used by GPUs, but you can find some simulation that can use it as well - but I wouldn't automatically assume it's necessary for a server.

Hence why I said single flop/cycle per AVX slot

bwang · Apr 21, 2020

whateverer said:
I already talked about single-cycle multiply-accumulate. Read the whole post next time.

It's an operation mostly used by GPUs, but you can find some simulation that can use it as well - but I wouldn't automatically assume it's necessary for a server.

Hence why I said single flop/cycle per AVX slot

Oops, didn't see that, sorry! I think single cycle MACs typically factor into supercomputer performance measurements, since dense matrix math makes great use of them.
I think the OP is mostly looking for FP performance data for historical purposes - for practical purposes vector FLOPS are somewhat tenuous since a lot of applications have computation patterns that aren't readily vectorized, or are completely bandwidth bound.

To answer the OP's questions:

Sandy Bridge/Ivy Bridge: 8 DP FLOPS/cycle
Haswell/Zen2/Skylake client: 16 DP FLOPS/cycle via single cycle MAC
Skylake-SP: 32 DP FLOPS/cycle

floating point operations per cycle

NeghVar

2[H]4U

tangoseal

[H]F Junkie

whateverer

[H]ard|Gawd

tangoseal

[H]F Junkie

bwang

[H]ard|Gawd

whateverer

[H]ard|Gawd

bwang

[H]ard|Gawd