Examining AMD’s RDNA 4 Changes in LLVM

As 2024 continues on, because time never stops, AMD has been working on their upcoming RDNA 4 architecture. Part of this involves supporting open source projects like LLVM. If done right, merging these changes early will ensure RDNA 4 will be well supported on launch. Because components like LLVM are open source, we can see these changes and get a preview of what AMD is changing over RDNA 3.

LLVM is particularly interesting because it’s a compiler project, and is used to compile source code into machine code for various architectures. The compiler has to be told how the ISA works in order to generate working code, so AMD has to add RDNA 4’s ISA changes in LLVM.

Terminology

AMD refers to their GPUs in several different ways. RDNA 3 and 4 refer to the architecture, but individual implementations of the architecture get a “gfx{number}” moniker. For example, the RX 7900 XTX target in LLVM is gfx1100, and the RX 6900 XT is gfx1030. For that reason, RDNA 3 and RDNA 2 are referred to as GFX11 and GFX10 respectively. In LLVM, RDNA 4 is referred to as GFX12.

LLVM code so far references gfx1200 and gfx1201 implementations of RDNA 4. These likely correspond to two different dies. AMD has a habit of giving different GPU architecture implementations distinct compiler targets. For example, the RX 6900 XT gets targeted with gfx1030 while the RDNA 2 integrated iGPU in Zen 4 desktop CPUs is gfx1036.

More Explicit Barriers

AMD GPUs resolve memory dependencies with explicit waits. For example, s_waitcnt vmcnt(0) would wait for all vector memory loads to finish, ensuring their results are available for subsequent instructions. Up to RDNA 3, AMD has several special registers that track the number of outstanding accesses in a specific category.

Register	Description	Size on RDNA 3	Comments
vmcnt	Counts pending vector memory loads. That includes texture sampling and raytracing accesses	6 bits, so up to 64 outstanding vector mem loads can be tracked	You’ll see this waited on a lot
lgkmcnt	Counts pending LDS and scalar loads	6 bits, so up to 64 outstanding loads in this category	You’ll also see this waited on a lot
vscnt	Counts pending vector stores	6 bits, 64 outstanding stores	Way less common than the above, but it occasionally pops up
expcnt	Counts pending exports, where a shader sends outputs to a fixed function unit. For example, a pixel shader may export pixel colors to the ROPs.	3 bits, 8 outstanding exports	Rarely waited on, but it does happen

What you can wait on in RDNA 3 and prior generations

RDNA 4 takes the categories above and breaks them down into more specific ones. With this move, AMD makes a dramatic change in their memory dependency handling scheme that had been in place since the original GCN architecture.

Register	Description	Size on RDNA 4	Previously covered by
loadcnt	Pending vector memory loads	6 bits, 64 pending loads	vmcnt
samplecnt	Image sampling (texture) loads	6 bits, 64 pending loads	vmcnt
bvhcnt	Bounding volume hierarchy (raytracing data structure) loads	3 bits, 8 pending loads	vmcnt
kmcnt	Scalar loads	5 bits, 32 pending loads	lgkmcnt
dscnt	Scatchpad (local data share) loads	6 bits, 64 pending loads	lgkmcnt
storecnt	Vector stores	6 bits, 64 pending stores	vscnt

Specified in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

Shader programs on RDNA 4 will be able to wait on memory accesses with better granularity. I feel like this change was done to make the ISA cleaner rather than significantly improve performance. GPUs use in-order execution and have high cache (and memory) latencies compared to CPUs, so individual threads will stall quite often due to memory latency. Perhaps this change in RDNA 4 will let threads extract a bit more instruction level parallelism by waiting on a false dependency less often.

Nvidia GPUs since Maxwell use a different scheme where six barriers can be flexibly assigned to wait on any access. It’s even more flexible than RDNA 4’s scheme because multiple barriers can be assigned to different accesses of the same category. However, the limited number of barriers (6) could be a limitation for Nvidia.

More Flexible Coherency Handling

Like CPUs, GPUs have to solve the problem where different cores have private caches but may need to make writes visible to each other. Up to RDNA 3, memory access instructions had a GLC bit that could be set to make them globally coherent. If the GLC bit is set for a load, it will intentionally miss in RDNA’s L0 and L1 caches, and go straight to L2. That’s because each L0 or L1 cache instance only services a Compute Unit or Shader Engine, respectively. By reading from L2, the instruction can be sure that it sees a write done by a thread running in another Shader Engine.

From the RDNA 3 ISA manual, showing cache behavior for different DLC/SLC/GLC bit settings. MISS_EVICT means the request will intentionally miss and evict the cacheline if it was cached before. STREAM means non-temporal

Similarly, the SLC (System Level Coherent) and DLC (Device Level Coherent) bits control L2 and Infinity Cache behavior respectively. Besides skipping cache levels, the GLC/SLC/DLC bits also provide non-temporal hints to RDNA’s caches. Non-temporal means the program is unlikely to reuse the data, so hardware should prefer to not keep it in cache. Such hints don’t necessarily evict the accessed data from cache. For example, RDNA’s L2 cache handles hits from streaming (non-temporal) accesses by leaving the line in cache, but not updating its LRU bits. Thus the line is more likely to be evicted in the future to make room for new data, but isn’t kicked out immediately.

RDNA 4 rearranges the cache policy bits to control the Shader Engine’s L1 cache separately. Non-temporal hints are separated out as well. Instead of using three bits (GLC/SLC/DLC), RDNA 4 controls cache behavior with four bits. Three provide temporal hints, and two specify the scope.

RDNA 4 Scope (2 bits)	RDNA 3 Equivalent Control Bit	Comments
CU (Compute Unit): 0	GLC	Similar to RDNA 3. Controls L0 caches, which are private to each CU
SE (Shader Engine): 1	GLC	RDNA 3 shader programs couldn’t control the SE’s 256 KB L1 separately. Setting the GLC bit would bypass both L0 and L1.
DEV (Device): 2	DLC?	Controls the Infinity Cache (MALL)
SYS (System): 3	SLC?	Controls the L2 cache

As specified in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIDefines.h

RDNA 4 could use its finer grained cache control to pass data between threads or kernel invocations through the L1 rather than doing so at L2. The Shader Engine’s L1 is much faster than the GPU-wide L2 on all RDNA generations, so taking advantage of that L1 cache whenever possible is a worthy goal. If setting the scope to SE lets RDNA 4 avoid L1 invalidations, AMD could gain some performance from higher L1 hitrates. But it could be hard to pull off as well, because any dependent thread will have to be launched on the same SE.

Temporal hints are given on RDNA 4 using three separate bits, rather than being implied via the SLC/DLC/GLC bits as before.

Temporal Hint (3 bits)	Comment
0: TH_RT	Regular
1: TH_NT	Non-temporal
2: TH_HT	High-temporal: Probably prefer to keep it in cache?
3: TH_LT	Last use. Could suggest evicting the line if present in cache?
3: TH_RT_WB	“Regular (CU, SE), high-temporal with write-back (MALL)”: Not sure why this uses the same bit pattern as TH_LT. Could apply to different access types
4: TH_NT_RT	“Non-temporal (CU, SE), regular (MALL)”
5: TH_RT_NT	“Regular (CU, SSE), non-temporal (MALL)”
6: TH_NT_HT	“Non-temporal (CU, SE), high-temporal (MALL)”
7: TH_NT_WB	“Non-temporal (CU, SE), high-temporal with write-back (MALL)”

MALL = infinity cache. As specified in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIDefines.h

RDNA 4 introduces high-temporal options too. High temporal likely signifies code expects to reuse the accessed data soon, so the cache should prefer to keep it stored. Just as with non-temporal hints, the hardware doesn’t have to act in a well defined way when it sees a high-temporal hint. For example, it could artificially keep the high-temporal line in the LRU position for a number of accesses regardless of whether it’s hit. Or, it could do nothing at all.

Better Tensors

AI hype is real these days. Machine learning involves a lot of matrix multiplies, and people have found that inference can be done with lower precision data types while maintaining acceptable accuracy. GPUs have jumped on the hype train with specialized matrix multiplication instructions. RDNA 3’s WMMA (Wave Matrix Multiply Accumulate) use a matrix stored in registers across a wave, much like Nvidia’s equivalent instructions.

Instruction	Multiplied Matrices (A and B) Format	Result/Accumulate Matrix Format
V_WMMA_F32_16X16X16_F16	FP16	FP32
V_WMMA_F32_16X16X16_BF16	BF16	FP32
V_WMMA_F16_16X16X16_F16	FP16	FP16
V_WMMA_BF16_16X16X16_BF16	BF16	BF16
V_WMMA_I32_16X16X16_IU8	INT8	INT32
V_WMMA_I32_16X16X16_IU4	INT4	INT32

RDNA 3’s WMMA instructions, from https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

RDNA 4 carries these instructions forward with improvements to efficiency, and adds instructions to support 8-bit floating point formats. AMD has also added an instruction where B is a 16×32 matrix with INT4 elements instead of 16×16 as in other instructions.

Instruction	Multiplied Matrices (A and B) Format	Result/Accumulate Matrix Format
V_WMMA_F32_16x16x16_FP8_BF8 and V_WMMA_F32_16x16x16_BF8_FP8	FP8 and BF8, or BF8 and FP8 (matrix multiplication is not commutative)	FP32
V_WMMA_F32_16x16x16_BF8_BF8	BF8	FP32
V_WMMA_F32_16x16x16_FP8_FP8	FP8	FP32
V_WMMA_I32_16X16X32_IU4	A: 16×16 INT4 B: 16×32 INT4	16×32 INT32

Machine learning has been trending towards lower precision data types to make more efficient use of memory capacity and bandwidth. RDNA 4’s support for FP8 and BF8 shows AMD doesn’t want to be left out as new data formats are introduced.

Sparsity

Moving to lower precision data formats is one way to scale matrix multiplication performance beyond what process node and memory bandwidth improvements alone would allow. Specialized handling for sparse matrices is another way to dramatically improve performance. Matrices with a lot of zero elements are known as sparse matrices. Multiplying sparse matrices can involve a lot less math because any multiplication involving zero can be skipped. Storage and bandwidth consumption can be reduced too because the matrix can be stored in a compressed format.

RDNA 4 introduces new SWMMAC (Sparse Wave Matrix Multiply Accumulate) instructions to take advantage of sparsity. SWMMAC similarly does a C += A * B operation, but A is a sparse matrix stored in half of B’s size. A sparsity index is passed as a fourth parameter to help interpret A as a full size matrix. My interpretation of this is that the dimensions in the instruction mnemonic refer to stored matrix sizes. Thus a 16x16x32 SWMMAC instruction actually multiplies a 32×16 sparse matrix with a 16×32 dense one, producing a 32×32 result.

Instruction	Multiplied Matrices (A and B) Format	Result/Accumulate Format
V_SWMMAC_F32_16X16X32_F16	FP16 A: 16×16 stored/32×16 actual B: 16×32	32×32 FP32
V_SWMMAC_F32_16X16X32_BF16	BF16	FP32
V_SWMMAC_F16_16X16X32_F16	FP16	FP16
V_SWMMAC_BF16_16X16X32_BF16	BF16	BF16
V_SWMMAC_I32_16X16X32_IU8	INT8	INT32
V_SWMMAC_I32_16X16X32_IU4	INT4	INT32
V_SWMMAC_I32_16X16X64_IU4	INT4 A: 16×16 stored/32×16 actual B: 16×64	32×64 INT32
V_SWMMAC_F32_16X16X32_FP8_FP8	FP8	FP32
V_SWMMAC_F32_16X16X32_FP8_BF8	FP8 and BF8	FP32
V_SWMMAC_F32_16X16X32_BF8_FP8	BF8 and FP8	FP32
V_SWMMAC_F32_16X16X32_BF8_BF8	BF8	FP32

If I guessed right, SWMMAC instructions would be the same as their WMMA siblings, but produce a result matrix twice as long in each dimension.

Of course there’s no way to infer performance changes from looking at LLVM code, but I wonder if AMD will invest in higher per-SIMD matrix multiplication performance in RDNA 4. RDNA 3’s WMMA instructions provide the same theoretical throughput as using dot product instructions.

[WMMA] instructions work over multiple cycles to compute the result matrix and internally use the DOT instructions
“RDNA 3” Instruction Set Architecture Reference Guide

Since SWMMAC takes a sparse matrix where only half the elements are stored, perhaps RDNA 4 can get a 2x performance increase from sparsity.

Software Prefetch

GPU programs typically enjoy high instruction cache hitrate because they tend to have a smaller code footprint than CPU programs. However, GPU programs suffer more from instruction cache warmup time because they tend to execute for very short durations. RDNA 3 mitigates this by optionally prefetching up to 64x 128-byte cachelines starting from a kernel’s entry point. RDNA 4 increases the possible initial prefetch distance to 256 x 128 bytes. Thus code size covered by the initial prefetch goes from 8 KB to 32 KB.

Once a kernel starts, the Compute Unit or Workgroup Processor frontend continues to prefetch ahead in the instruction stream. Prefetch distance is controlled by an instruction in the shader program. Up to RDNA 3, AMD GPUs could be told to prefetch up to three 64B cachelines ahead of the currently executing instruction.

Snippet of RDNA 3 shader code showing the `s_set_inst_prefetch_distance` instruction telling hardware to prefetch 3 cachelines ahead of the program counter

As far as I know, prefetching only applies to the instruction side. There’s no data-side prefetcher, so RDNA 3 SIMDs rely purely on thread and instruction level parallelism to hide memory latency.

RDNA 4 adds new instructions that let software more flexibly direct prefetches, rather than just going in a straight line. For example, s_prefetch_inst could point instruction prefetch to the target of a probably taken branch. If my interpretation is correct, RDNA 4 could be better at handling large shader programs, with instruction prefetch used to reduce the impact of instruction cache misses.

Instruction	My Guess on What it Does	Comments
s_prefetch_inst	Prefetch instruction(s) starting at the specified absolute address	Example given with `s_prefetch_inst s[14:15], 0x7fffff, m0, 7` Such a large immediate is probably a PC (program counter) value
s_prefetch_inst_pc_rel	Prefetch instruction(s) starting at an offset from the current instruction	Add the immediate to the program counter and start prefetching at that location
s_prefetch_data	Prefetch data from the specified address?
s_prefetch_data_pc_rel	Prefetch data from the specified offset from the program counter?	CPU code often uses PC-relative addressing for constants. Maybe AMD GPU binaries do the same?
s_buffer_prefetch_data	Prefetch data from a specified buffer	Shader programs can have buffers bound as a resource. This instruction allows prefetching data from those buffers

See https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SMInstructions.td

On the data side, RDNA 4 appears to introduce software prefetch instructions as well. GPUs typically don’t do any prefetching, and use a combination of wide accesses and thread level parallelism to achieve high bandwidth. In contrast, CPUs often run programs with little explicit parallelism and can benefit greatly if prefetch reduces program-visible memory latency.

But maximizing performance isn’t a black and white thing, and GPU code can be latency limited even with a lot of threads in flight. In those cases, prefetch could be a promising strategy as long as there’s bandwidth to spare. One hypothetical example might be prefetching all the children of nodes several levels ahead in a raytracing kernel.

Sub 32-bit Scalar Loads

Graphics workloads typically use 32-bit data types, but compute workloads use all sorts of data widths. AMD’s GPUs had flexible load widths on the vector side for a while, but the scalar path was restricted to 32-bit or larger loads. RDNA 4 changes this by adding 8-bit and 16-bit scalar load instructions.

Instruction	Comment
s_load_u16	Loads a 16-bit unsigned integer from memory into a scalar register
s_load_u8	Loads an 8-bit unsigned integer from memory into a scalar register
s_load_i16	Loads a 16-bit signed integer from memory into a scalar register
s_load_i8	Loads an 8-bit signed integer from memory into a scalar register

This change is mostly about making the ISA nicer for compute programs that use 8 or 16-bit data types. On prior AMD GPU generations, you can achieve similar results by loading a 32-bit value and masking off the high bits.

Final Words

GPU instruction sets are more fluid than CPU ones because developers don’t ship GPU binaries. Rather, GPU drivers compile source (or intermediate) code into GPU binaries on the user’s machine. Thus GPU makers can change the GPU ISA between generations without breaking compatibility, so long as they update their drivers to handle the new ISA.

Thus a GPU’s ISA is often more closely tied to hardware than a CPU’s ISA. Looking at GPU ISA changes shows how far GPUs have come from the days when they were pure graphics processors. Over the past couple decades, GPUs got flexible branching, support for standard IEEE floating point formats, scalar datapaths, and specialized matrix multiplication instructions.

RDNA 4 continues AMD’s GPU ISA evolution. Software prefetch and more flexible scalar loads continue a trend of GPUs becoming more CPU-like as they take on more compute applications. AI gets a nod as well with FP8 and sparsity support. Better cache controls are great to see as well, and more closely match the ISA to RDNA’s more complex cache hierarchy.

Finally, remember nothing is final until a RDNA 4 product is released. All the information here is preliminary. Reading code in an unfamiliar project can be hard too, so there’s a chance I made a mistake somewhere. I highly encourage you to go through the LLVM source code yourself. To make that easier, I’ve sprinkled links to the appropriate files on Github throughout the article. Have fun!

If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.

Examining AMD’s RDNA 4 Changes in LLVM

Terminology

More Explicit Barriers

More Flexible Coherency Handling

Better Tensors

Sparsity

Software Prefetch

Sub 32-bit Scalar Loads

Final Words

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112