The article was indicating: "As a result of this new partitioning, each Ampere SM partition can execute either 32 FP32 instructions per clock or 16 FP32 and 16 INT32 instructions per cycle. You’re essentially trading integer performance for twice the floating-point capability. Fortunately, as the majority of graphics workloads are FP32, this should work towards NVIDIA’s advantage." But I must still be misunderstanding something, because if the majority of workloads are FP32 focused, then the sheer number of cores should be able to complete the workload much faster...
You still have to feed data to those execution units. That requires tons of bandwidth. Caches do help but only so much. For compute workloads where the data fits entirely in cache you see GPUs hitting close to their peak instruction throughout. This isn’t the case for games.
In games only some parts of a frame are FP32 heavy. Other parts of the frame are 100% bandwidth or fillrate limited. You will never see perfect scaling based on FP32 alone. You need to scale up everything else too.