• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

HipKittens unleashes a tile programming based abstraction layer for AMD GPUs

Marees

Supreme [H]ardness
Joined
Sep 28, 2018
Messages
4,587
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly.

To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specificlanguages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.


We explore the extentto which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.

We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD.

We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines.

Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by 1.2 − 2.4× (e.g., d = 64 attention, GQA backwards, memory-bound kernels).

These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors.

HipKittens is released at: https://github.com/HazyResearch/HipKittens.

https://x.com/AnushElangovan/status/1988393252555493739?s=20

https://x.com/simran_s_arora/status/1988320532585214231?s=20
 
In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels.


The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.



We find:
  1. The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
  2. Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
  3. Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.
Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable.


https://hazyresearch.stanford.edu/blog/2025-11-09-hk
 
Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # CMFMA ShapeOutputTFLOPS
HK 4 / 816×16×32128×256893
HK 4 / 1216×16×32192×2561278
HK 0 / 816×16×32192×2561281
HK 0 / 816×16×32256×2561605
TK256×256×16256×2561538
CUTLASS256×256×16256×2561570
Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But… AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., 16×16×32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

  1. 8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
  2. 4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.
These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x


https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr
 
Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # CMFMA ShapeOutputTFLOPS
HK 4 / 816×16×32128×256893
HK 4 / 1216×16×32192×2561278
HK 0 / 816×16×32192×2561281
HK 0 / 816×16×32256×2561605
TK256×256×16256×2561538
CUTLASS256×256×16256×2561570
Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But… AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., 16×16×32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

  1. 8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
  2. 4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.
These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x


https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr


Links: Arxiv | Code


  1. We believe that the HIPCC register scheduling is one of the most important areas for improvement in AMD's kernel software stack.
  2. We hope these findings lead to hardware changes that support wave specialization or guide AMD kernel development; for instance, Mojo currently provides a warp-specialized matmul kernel as of 11/06/2025 even though AMD CDNA doesn’t have register reallocation.
 
Comments by a blogger on ThunderKittens for nvidia

https://alanwu.xyz/posts/thunderkittens/


here is my 90% confident guess of how TK works on a high level.
  • It does tiling and picks best memory layouts to utilize tensor cores, the fastest parts on an GPU.
  • It gives program templates (C++ term for generics) for async computations. Devs just need to write a few boilerplate functions to coordinate async execution of computations.
  • It handles scheduling. It tries to reuse L2 cache and reduce pipeline idle time.

Why you should use TK​

TK aims to put common optimizations into a CUDA library to make writing near-optimal CUDA a lot easier. It has a PyTorch-like API, so ML folks without experience in CUDA can pick it up fast. Triton has a similar goal, but the TK authors claim that optimal performance isn’t possible with Triton.

Moving from Triton to TK provides big performance gains. TK is significantly simpler to program than CUTLASS.


https://hazyresearch.stanford.edu/blog/2024-05-12-quick-tk
 
To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.


We explore the extent to which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.
For GPU coding nerds who are interested

Tile based GPU programming is now mainstream

https://x.com/nvidianewsroom/status/1996976316453679381?s=20



G7atv3fWoAEYV1p.jpeg
 
So newbie question. Is this a win for AMD itself? Third parties?

Does this improve the competitiveness of AMD vs. NVidia GPUs?
 
So newbie question. Is this a win for AMD itself? Third parties?

Does this improve the competitiveness of AMD vs. NVidia GPUs?

Nvidia is always a step ahead of AMD in terms of hardware

In terms of software I think ThunderKittens/HipKittens brings parity across AMD & Nvidia. You don't need CUDA or ROCM

You would still need a few tweaks because the hardware changes by each gen. What works in one gen may not be optimal in next gen etc.
 
Back
Top