AMD publishes Zen 2 compiler patch, exposing some new instructions

Anything that would be advantaged by AVX512 would be better advantaged by being thrown at a GPU or dedicated hardware of some sort, registers that wide are just stupid on a cpu. I would much prefer that AMD use their transistor budget on things that are actually useful.
 
Anything that would be advantaged by AVX512 would be better advantaged by being thrown at a GPU or dedicated hardware of some sort, registers that wide are just stupid on a cpu. I would much prefer that AMD use their transistor budget on things that are actually useful.

AVX512 comes with Zen3 if the leaks are correct.
 
AVX512 comes with Zen3 if the leaks are correct.

Still not anything that most will care about tho, AVX512 is just too niche as very few programs are written to take advantage of it.
 
Still not anything that most will care about tho, AVX512 is just too niche as very few programs are written to take advantage of it.
Yeah two whole programs? Lol Epyc is a lemon now guys.
 
Still not anything that most will care about tho, AVX512 is just too niche as very few programs are written to take advantage of it.

AVX512 is a standard in the server/HPC space. It is also relatively important for workstations.
 
It's odd that they didn't add it, IMHO. Didn't they change the FPU to a 256 bit width? Why not then implement AVX512 via 2x 256 bit units fusing? Seems like low hanging fruit. Or am I missing something?
 
It's odd that they didn't add it, IMHO. Didn't they change the FPU to a 256 bit width? Why not then implement AVX512 via 2x 256 bit units fusing? Seems like low hanging fruit. Or am I missing something?

No driving need as almost no programs utilize it except benchmarks. Got to have demand to waste engineering sources on it, I think the new Zen design ate most of their time. If it comes in Zen 3 then likely they just didnt feel they had adequate time to implement it.
 
The resources would be comparatively small, unless I'm missing something. Even from a marketing angle, it is well to win things like this if it's not expensive to do so.
 
AVX512 comes with Zen3 if the leaks are correct.

Might very well be the case, and for compatibility they probably should implement them. But outside of nich applications AVX512 is pretty useless. 16x16 (float) matrix math, "large" array summations, I mean its good for machine learning, or scientific work, but a GPU would still be leaps and bounds better most of the time. And remember, with any SIMD its easy to hit a wall with the memory bus, too "large" a dataset and it spills out of cache and makes the SIMD utterly pointless from a performance point of view. And we need not even talk about the complications that arise from trying to rearrange data to fit into such large registers, without a performance hit just from prearranging all of it thus negating the benefits of them in the first place.
 
Not having AVX512 is a deal breaker for me. Will be buying intel next gen or keeping my 2700. AMD screws up their launches every time.
 
A big reason why smid wider isn't always better. Except for the cases where data is naturally in order you have to swizzle it into order. For data greater than 128 bits (SSE), every 128 bits is a lane. Respectively AVX is 2 lanes, AVX512 is 4 lanes.

There are many functions for manipulating data within a lane, shuffle, unpack, etc, but that does you no good when you need to operate on a linear set of data. To do that you need to pre-lane you data and there is really only one instruction and port to do that, permute*. For an AVX256 2 lane reorder it usually takes one pass to order your data, but as you guessed it, it usually takes 4 passes to order an AVX512 data set.

Now if your data is long enough and the the problem complicated enough this is all worth it. But more often than not port 5 pressure can erase much of the gains of a wide operation.

Even worse many compilers, cough cough gcc, will default to using a vinsertf128 because, for pre Haswell/Bulldozer, in many cases it may be faster. Someday the default will change but for now this seems to be the case.

* yes there are 2 instructions one floating point one integer but they do the same thing. Ironically the FP is in AVX and the int is in AVX2. Someone in the know please explain this to me.
 
Last edited:
The interview with Mark P suggests that it is still possible

IC: With the FP units now capable of doing 256-bit on their own, is there a frequency drop when 256-bit code is run, similar to when Intel runs AVX2?

MP: No, we don’t anticipate any frequency decrease. We leveraged 7nm. One of the things that 7nm enables us is scale in terms of cores and FP execution. It is a true doubling because we didn’t only double the pipeline with, but we also doubled the load-store and the data pipe into it.

IC: Now the Zen 2 core has two 256-bit FP pipes, can users perform AVX512-esque calculations?

MP: At the full launch we’ll share with you exact configurations and what customers want to deploy around that.
 
Even worse many compilers, cough cough gcc, will default to using an vinsertf128 because, for pre Haswell/Bulldozer, in many cases it may be faster. Someday the default will change but for now this seems to be the case.

* yes there are 2 instructions one floating point one integer but they do the same thing. Ironically the FP is in AVX and the int is in AVX2. Someone in the know please explain this to me.

The float version of that instruction likely normalizes the float before it stores it(might reduce losses accumulated in future operations). And actually Intel's docs show their being 7 opcodes not just 2, addition versions for double precision, and different target registers. This instruction is likely preferred by the compiler because it is encoded into fewer bytes then other options for a similar operation.

Also I would be surprised if current GCC still uses this with new cpu arch profiles on Intel chips, however I can't confirm as that I havent checked!

Honestly if Intel wanted to give me something useful, give me some packed decimal formats like you find in IBM Z-series chips, Decimal128 support would be awesome, right up their with the legalization/mandate of shock collars for programmers who use floats for money calcs, or precision/accuracy important scientific data.

Or dump AVX512 stick with AVX256 and increase the number of threads per core, and increase the number of FP execution units for the scheduler to fill out, that would give more throughput to the applications that need it, and can be disabled for the gamers that want single thread above all else, games don't use the full width of even AVX256 generally(some fractal equations in no mans sky can use AVX256 if that is yer thing). Unless games start using doubles instead of floats of course, then its all pretty useless.
 
Back
Top