I don't think fury was actually inefficient, in dx11 the overhead held it back. In dx12 it performs just as it should in AotS (compute heavy).
It's geometry performance was abysmal compared to maxwell though, and in the absence of async compute it couldn't saturate the shader array because the geometry work was stalling the pipeline
The latest nvidia drivers upturned all my results, but previously the Fury X and 980TI were performing identically in AotS (flop for flop) with async enabled for the Fury. Now the 980ti appears to pull ahead by around 8%.
With async enabled the 980ti is now matching fury X whereas it had been around 10% slower before
thats what I meant. The shader power was going to waste under lot of situations where dx12 actually pushed the card. But AMD themselves said with GCN 4.0 their goal was to reduce waste and increase efficiency by upgrading their front end to better utilize the shaders.