Long article:
https://chipsandcheese.com/2023/09/...erformance-on-nvidias-4090-and-amds-7900-xtx/
let’s examine a particular scene.
The profiled scene
We analyzed this scene using Nvidia’s Nsight Graphics and AMD’s Radeon GPU Profiler to get some insight into why Starfield performs the way it does. On the Nvidia side, we covered the last three generations of cards by testing the RTX 4090, RTX 3090, and Titan RTX. On AMD, we tested the RX 7900 XTX. The i9-13900K was used to collect data for all of these GPUs.
We’ll be analyzing the three longest duration calls because digging through each of the ~6800 events would be impractical.
Register allocation doesn’t differ much between AMD and Nvidia, but Nvidia’s much smaller register file means its architectures can’t keep as much work in flight per SIMD lane.
The takeaway from this shader is that AMD’s RDNA 3 architecture is better set up to feed its execution units. Each SIMD has three times as much vector register file capacity as Nvidia’s Ampere, Ada, or Turing SMSPs, allowing higher occupancy.
Nvidia can’t use a wave64 mode, and the green team’s compiler likely allocated fewer registers per thread as well.
This compute shader has a larger hot working set than the prior ones, and L1 hitrate is lower across all three GPUs. Nvidia’s GPUs have a pretty hard time keeping accesses within their L1 caches. AMD somehow enjoys a higher hitrate for its small 32 KB L0 cache, though the larger 256 KB L1 barely enters the picture with a measly 13.5% hitrate.
L2 caches are large enough to catch the vast majority of L1 misses across all tested GPUs. Earlier, I assumed Nvidia’s RTX 4090 had enough L2 bandwidth to handle most workloads, so Nvidia’s simpler two-level cache hierarchy was justified. This shader is an exception, and L2 bandwidth limits prevent Nvidia’s much larger RTX 4090 from beating the RX 7900 XTX.
In AMD’s favor, they have a very high bandwidth L2 cache. As the first multi-megabyte cache level, the L2 cache plays a very significant role and typically catches the vast majority of L0/L1 miss traffic. Nvidia’s GPUs become L2 bandwidth bound in the third longest shader, which explains a bit of why AMD’s 7900 XTX gets as close as it does to Nvidia’s much larger flagship. AMD’s win there is a small one, but seeing the much smaller 7900 XTX pull ahead of the RTX 4090 in any case is not in line with anyone’s expectations. AMD’s cache design pays off there.
there’s no single explanation for RDNA 3’s relative overperformance in Starfield. Higher occupancy and higher L2 bandwidth both play a role, as does RDNA 3’s higher frontend clock.
Going forward, AMD will need more compute throughput if they want to contend for the top spot.
https://chipsandcheese.com/2023/09/...erformance-on-nvidias-4090-and-amds-7900-xtx/
let’s examine a particular scene.

We analyzed this scene using Nvidia’s Nsight Graphics and AMD’s Radeon GPU Profiler to get some insight into why Starfield performs the way it does. On the Nvidia side, we covered the last three generations of cards by testing the RTX 4090, RTX 3090, and Titan RTX. On AMD, we tested the RX 7900 XTX. The i9-13900K was used to collect data for all of these GPUs.
We’ll be analyzing the three longest duration calls because digging through each of the ~6800 events would be impractical.
1. Longest Duration Compute Shader
cache access latency is very high on GPUs, so higher occupancy often correlates with better utilization.Register allocation doesn’t differ much between AMD and Nvidia, but Nvidia’s much smaller register file means its architectures can’t keep as much work in flight per SIMD lane.
The takeaway from this shader is that AMD’s RDNA 3 architecture is better set up to feed its execution units. Each SIMD has three times as much vector register file capacity as Nvidia’s Ampere, Ada, or Turing SMSPs, allowing higher occupancy.
2. Longest Duration Pixel Shader
The takeaway from this shader is that AMD is able to achieve very high utilization thanks to very high occupancy. In fact, utilization is so high that AMD is compute bound. Nvidia hardware does well in this shader, but not quite as well because they again don’t have enough register file capacity to keep as much work in flight.3. Second Longest Compute Shader
AMD opted to run this shader in wave64 mode, in contrast to the wave32 mode used before.Nvidia can’t use a wave64 mode, and the green team’s compiler likely allocated fewer registers per thread as well.
This compute shader has a larger hot working set than the prior ones, and L1 hitrate is lower across all three GPUs. Nvidia’s GPUs have a pretty hard time keeping accesses within their L1 caches. AMD somehow enjoys a higher hitrate for its small 32 KB L0 cache, though the larger 256 KB L1 barely enters the picture with a measly 13.5% hitrate.
L2 caches are large enough to catch the vast majority of L1 misses across all tested GPUs. Earlier, I assumed Nvidia’s RTX 4090 had enough L2 bandwidth to handle most workloads, so Nvidia’s simpler two-level cache hierarchy was justified. This shader is an exception, and L2 bandwidth limits prevent Nvidia’s much larger RTX 4090 from beating the RX 7900 XTX.
In AMD’s favor, they have a very high bandwidth L2 cache. As the first multi-megabyte cache level, the L2 cache plays a very significant role and typically catches the vast majority of L0/L1 miss traffic. Nvidia’s GPUs become L2 bandwidth bound in the third longest shader, which explains a bit of why AMD’s 7900 XTX gets as close as it does to Nvidia’s much larger flagship. AMD’s win there is a small one, but seeing the much smaller 7900 XTX pull ahead of the RTX 4090 in any case is not in line with anyone’s expectations. AMD’s cache design pays off there.
Final Words
there’s really nothing wrong with Nvidia’s performance in this game, as some comments around the internet might suggest. Lower utilization is by design in Nvidia’s architecture. Nvidia SMs have smaller register files and can keep less work in flight. They’re naturally going to have a more difficult time keeping their execution units fed. Cutting register file capacity and scheduler sizes helps Nvidia reduce SM size and implement more of them. Nvidia’s design comes out top with kernels that don’t need a lot of vector registers and enjoy high L1 cache hitrates.there’s no single explanation for RDNA 3’s relative overperformance in Starfield. Higher occupancy and higher L2 bandwidth both play a role, as does RDNA 3’s higher frontend clock.
Going forward, AMD will need more compute throughput if they want to contend for the top spot.