HipKittens unleashes a tile programming based abstraction layer for AMD GPUs

Marees · Nov 11, 2025

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly.

To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specificlanguages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.

We explore the extentto which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.

We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD.

We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines.

Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by 1.2 − 2.4× (e.g., d = 64 attention, GQA backwards, memory-bound kernels).

These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors.

HipKittens is released at: https://github.com/HazyResearch/HipKittens.

https://x.com/AnushElangovan/status/1988393252555493739?s=20

https://x.com/simran_s_arora/status/1988320532585214231?s=20

Marees · Nov 11, 2025

In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels.

The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

We find:

The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable.

https://hazyresearch.stanford.edu/blog/2025-11-09-hk

Marees · Nov 11, 2025

Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # C	MFMA Shape	Output	TFLOPS
HK 4 / 8	16×16×32	128×256	893
HK 4 / 12	16×16×32	192×256	1278
HK 0 / 8	16×16×32	192×256	1281
HK 0 / 8	16×16×32	256×256	1605
TK	256×256×16	256×256	1538
CUTLASS	256×256×16	256×256	1570

Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But… AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., 16×16×32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.

These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x

https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

Marees · Nov 11, 2025

Marees said:
Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # C MFMA Shape Output TFLOPS
HK 4 / 8 16×16×32 128×256 893
HK 4 / 12 16×16×32 192×256 1278
HK 0 / 8 16×16×32 192×256 1281
HK 0 / 8 16×16×32 256×256 1605
TK 256×256×16 256×256 1538
CUTLASS 256×256×16 256×256 1570

Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But… AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., 16×16×32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!

4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.

These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x

https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

Links: Arxiv | Code

We believe that the HIPCC register scheduling is one of the most important areas for improvement in AMD's kernel software stack.↩
We hope these findings lead to hardware changes that support wave specialization or guide AMD kernel development; for instance, Mojo currently provides a warp-specialized matmul kernel as of 11/06/2025 even though AMD CDNA doesn’t have register reallocation.

Marees · Nov 11, 2025

Comments by a blogger on ThunderKittens for nvidia

https://alanwu.xyz/posts/thunderkittens/

here is my 90% confident guess of how TK works on a high level.

It does tiling and picks best memory layouts to utilize tensor cores, the fastest parts on an GPU.
It gives program templates (C++ term for generics) for async computations. Devs just need to write a few boilerplate functions to coordinate async execution of computations.
It handles scheduling. It tries to reuse L2 cache and reduce pipeline idle time.

Why you should use TK

TK aims to put common optimizations into a CUDA library to make writing near-optimal CUDA a lot easier. It has a PyTorch-like API, so ML folks without experience in CUDA can pick it up fast. Triton has a similar goal, but the TK authors claim that optimal performance isn’t possible with Triton.

Moving from Triton to TK provides big performance gains. TK is significantly simpler to program than CUTLASS.

https://hazyresearch.stanford.edu/blog/2024-05-12-quick-tk

Marees · Dec 5, 2025

Marees said:
To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.

We explore the extent to which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.

For GPU coding nerds who are interested

Tile based GPU programming is now mainstream

https://x.com/nvidianewsroom/status/1996976316453679381?s=20

Marees · Dec 6, 2025

Marees said:
For GPU coding nerds who are interested

Tile based GPU programming is now mainstream

https://x.com/nvidianewsroom/status/1996976316453679381?s=20

View attachment 770928

Apparently the latest versions of ThunderKittens is known as parallel kittens

https://x.com/stuart_sul/status/1990482396232466615?s=20

philb2 · Dec 6, 2025

So newbie question. Is this a win for AMD itself? Third parties?

Does this improve the competitiveness of AMD vs. NVidia GPUs?

Marees · Dec 6, 2025

philb2 said:
So newbie question. Is this a win for AMD itself? Third parties?

Does this improve the competitiveness of AMD vs. NVidia GPUs?

Nvidia is always a step ahead of AMD in terms of hardware

In terms of software I think ThunderKittens/HipKittens brings parity across AMD & Nvidia. You don't need CUDA or ROCM

You would still need a few tweaks because the hardware changes by each gen. What works in one gen may not be optimal in next gen etc.

HipKittens unleashes a tile programming based abstraction layer for AMD GPUs

Marees

Supreme [H]ardness

Marees

Supreme [H]ardness

Marees

Supreme [H]ardness

Marees

Supreme [H]ardness

Marees

Supreme [H]ardness

Why you should use TK

Marees

Supreme [H]ardness

Marees

Supreme [H]ardness

philb2

2[H]4U

Marees

Supreme [H]ardness

HipKittens unleashes a tile programming based abstraction layer for AMD GPUs

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

Why you should use TK​

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Why you should use TK