HIP compilers#
AMD GPUs. The compilers set up the default libraries, and include paths for the HIP and ROCm libraries, along with required environment variables. For more information, see the ROCm compiler reference.
Compilation workflow#
HIP provides a flexible compilation workflow that supports both offline compilation and runtime or just-in-time (JIT) compilation. Each approach has advantages depending on the use case, target architecture, and performance needs.
The offline compilation is ideal for production environments, where the performance is critical, and the target GPU architecture is known in advance.
The runtime compilation is useful in development environments or when distributing software that must run on a wide range of hardware without the knowledge of the GPU in advance. It provides flexibility at the cost of some The runtime compilation is useful in development environments or when distributing software that must run on a wide range of hardware without prior knowledge of the GPU. It provides flexibility at the cost of some performance overhead.
Offline compilation#
Offline compilation is performed in two steps: host and device code compilation.
Host-code compilation: On the host side,
amdclang++orhipcccan compile the host code in one step without other C++ compilers.Device-code compilation: The compiled device code is embedded into the host object file. Depending on the platform, the device code can be compiled into assembly or binary.
For an example on how to compile HIP from the command line, see SAXPY tutorial .
Runtime compilation#
HIP allows you to compile kernels at runtime using the hiprtc* API. Kernels
are stored as a text string, which is passed to HIPRTC alongside options to
guide the compilation.
For more information, see HIP runtime compiler.
Target GPU architectures (GFX IP)#
Instructions in the AMDGPU instruction set architecture (ISA) are compatible only with certain physical GPU architectures. The versioning system that abstracts hardware details from compilers and software is called the GFX IP version. Each AMD GPU family, such as RDNA or CDNA, defines its own GFX IP identifiers. These identifiers specify which instruction formats, memory models, and compute features are supported by that generation of hardware.
A GFX version is expressed as a short string, for example:
gfx90a— CDNA2 (MI250 series)gfx942— CDNA3 (MI300 series)gfx1100— RDNA3 (RX 7900 series)
Most GFX IP versions are composed of three numerical fields, which act roughly like a major-minor-subminor versioning system:
The major version corresponds to the architectural family (for example, CDNA3 versus RDNA3).
The minor version encodes core feature updates within that family (for example, new MFMA or LDS capabilities).
The subminor version may specify packaging or stepping differences (for example, MI300A versus MI300X).
Each new generation introduces additional hardware features, for example,
mixed-precision MFMA instructions, asynchronous data movement engines, and
updated cache hierarchies, while maintaining broad forward compatibility for
code compiled at the intermediate representation (IR) level. When compiling GPU
programs with amdclang++, the target GFX architecture determines the ISA
and hardware features available to the kernel.
For example, compiling for the MI300X would use:
amdclang++ --offload-arch=gfx942 kernel.cpp -o kernel.out
The compiler uses this flag to emit optimized machine code for that GPU’s compute units, matrix cores, and memory subsystem. The GFX IP acts as a virtual hardware target, decoupling high-level programming models (HIP, OpenMP, OpenCL) from the underlying physical GPU design.
While AMD does not explicitly name its compatibility model as “onion-layered,” the GFX IP system follows a similar principle: newer architectures generally extend, rather than break, existing instruction sets.
Thus, code compiled for an older architecture (for example, gfx90a) can
often be re-optimized or recompiled for a newer one (gfx942) with minimal
modification. However, compatibility is not guaranteed across major GFX
families, since those may introduce fundamentally new pipelines or execution
semantics.
AMDGPU assembly#
AMDGPU assembly, sometimes called GFX ISA (Instruction Set Architecture), is the low-level assembly format for programs running directly on AMD GPUs. It is the lowest-level human-readable representation of GPU instructions, typically generated by the ROCm compiler toolchain and converted into binary microcode for execution on the device.
Each AMD GPU architecture family, such as RDNA or CDNA, defines its own GFX ISA version. For example:
gfx90acorresponds to CDNA2 (MI250 series)gfx942corresponds to CDNA3 (MI300 series)gfx1100corresponds to RDNA3 consumer GPUs
Each version defines its unique instruction encodings, supported data types, and pipeline behavior. Compiled GPU kernels must target a specific GFX architecture to ensure instruction compatibility and optimal scheduling.
AMDGPU assembly is rich and expressive, exposing every level of GPU behavior:
scalar operations, vector math, memory access, synchronization, and
matrix-multiply instructions. A few examples of instructions from the
gfx942 architecture include:
v_add_f32 v0, v1, v2— add 32-bit floats in vector registerss_mov_b32 s4, 0x3f800000— move immediate value (1.0f) into a scalar registerv_mfma_f32_16x16x4f16 v[0:15], v[16:31], v[32:47], v[0:15]— perform a 16×16×4 matrix fused multiply-adds_barrier— synchronize all warps in the work-group
In this syntax:
v_instructions operate on vector registers (VGPRs) per thread.s_instructions operate on scalar registers (SGPRs) shared by the warp.Specialized matrix instructions (
v_mfma_*) invoke the Matrix Core (MFMA) hardware units.
Although AMDGPU assembly can be written by hand, this is uncommon. Developers typically inspect compiler‑generated assembly when optimizing kernels, diagnosing register pressure, or tuning memory‑access patterns
The ROCm disassembler (llvm-objdump) and ROCm profiler (rocprofv3)
allow inspection of generated GFX ISA alongside high-level HIP or OpenMP source
code. Because the instruction semantics and binary encodings are publicly
documented by AMD, the GFX ISA is a fully open, compiler-targetable standard.
This openness allows tool developers, performance engineers, and researchers to reason about GPU behavior at the instruction level, bridging the gap between hardware and high-level kernel code.
AMDGPU intermediate representation#
AMDGPU intermediate representation (AMDGPU IR) is an intermediate
representation for code that runs on AMD GPUs and other parallel processors. It
is one of the key outputs of the ROCm compiler toolchain, produced by
amdclang++ before being translated into architecture-specific GPU assembly
(GFX ISA).
AMD documentation refers to this layer as both a virtual instruction set and a target-specific dialect of LLVM IR. From a programmer’s perspective, AMDGPU IR defines a virtual machine model for parallel thread execution: compilers and optimizers emit this IR with the expectation that it will execute with consistent semantics across multiple generations of AMD hardware, including future architectures not yet released.
This makes AMDGPU IR a “narrow waist” between software and hardware, abstracting the details of the physical compute units, warp schedulers, and memory hierarchies, while still providing explicit control over threads, registers, and memory spaces.
Unlike traditional CPU ISAs, AMDGPU IR is not executed directly. Instead, it is
further compiled (either just‑in‑time or ahead‑of‑time) into GFX ISA, the
device-specific binary that runs on a given GPU architecture (for example,
gfx942 for CDNA3 MI300X or gfx1200 for RDNA3). This multi-stage
compilation allows forward compatibility: kernels compiled for one generation
can often be re-optimized and executed efficiently on newer hardware through
updated ROCm runtimes and drivers. Some exemplary fragments of AMDGPU IR:
; declare usage of 64 vector registers
!amdgpu.num_vgpr = !{i32 64}
; perform fused multiply-add
%f5 = call float @llvm.fma.f32(float %f3, float %f4, float 1.5)
; read thread and workgroup indices
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%wgid = call i32 @llvm.amdgcn.workgroup.id.x()
These instructions represent operations in the virtual machine model:
Registers are virtual VGPRs/SGPRs dynamically mapped to physical hardware registers by the compiler.
Arithmetic and memory intrinsics (
llvm.fma,llvm.amdgcn.buffer.load) map one-to-one to GPU instructions.Built-in functions like
llvm.amdgcn.workitem.id.x()access special per-thread state, such as the current thread or block index.
The AMDGPU IR machine model reflects the hardware reality of AMD GPUs: a single instruction stream drives a wavefront of 64 threads that execute in lockstep on the SIMD pipelines, each maintaining its own register state while sharing program flow and local memory. Warps cooperate through the Local Data Share (LDS) and synchronize via barriers, the same abstractions exposed in the high-level ROCm programming model.
Since AMDGPU IR is integrated with the open-source LLVM compiler infrastructure, its syntax and semantics are well-documented and publicly accessible. Developers can inspect, modify, or emit AMDGPU IR directly for performance analysis, research, or custom toolchains.
ROCm binary utilities#
The ROCm binary utilities are a collection of command-line tools for examining
and manipulating GPU binaries produced by amdclang++ or other ROCm build
tools. These utilities allow developers to inspect, disassemble, and analyze
AMDGPU code objects, the compiled GPU kernels embedded in host executables or
distributed as standalone .hsaco files.
The llvm-objdump utility provides multiple capabilities for analyzing GPU
binaries. With the --offloading flag, it can list and extract information
from the contents of ROCm binaries, including code object metadata, kernel
symbols, target architectures (for example, gfx90a, gfx1100), and
linkage details. It supports both standalone .hsaco files and “fat
binaries” embedded within host executables. With the --triple=amdgcn flag,
it can disassemble GPU kernels into human-readable AMDGPU ISA, allowing
inspection of instruction sequences, register allocation, and control flow.
These capabilities are essential for performance debugging, code verification,
and low-level kernel analysis, for example, when tuning MFMA instructions or checking compiler optimizations.
Together, these utilities provide developers with insight into how HIP C++ code
is compiled, optimized, and mapped to GPU hardware. They complement profiling
tools like rocprofv3 by exposing the static structure of compiled GPU
binaries.
Static libraries#
amdclang++ supports generating two types of static libraries.
The first type of static library only exports and launches host functions within the same library and not the device functions. This library type offers the ability to link with another compiler such as
gcc. Additionally, this library type contains host objects with device code embedded as fat binaries. This library type is generated using the flag--emit-static-lib:amdclang++ hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
The second type of static library exports device functions to be linked by other code objects by using
amdclang++as the linker. This library type contains relocatable device objects and is generated usingar:amdclang++ hipDevice.cpp -c -fgpu-rdc -o hipDevice.o ar rcsD libHipDevice.a hipDevice.o amdclang++ libHipDevice.a test.cpp -fgpu-rdc -o test.out
Examples of this can be found in rocm-examples under static host libraries or static device libraries.