Third-party tools#

Many third-party tools can support your HIP and ROCm development. This section provides a brief overview of some of these tools and how you can use them with AMD GPUs. For detailed, up-to-date information, refer to the official documentation of each tool, as features and compatibility change frequently.

Performance monitoring and profiling#

You can use several tools to monitor performance counters and profile your GPU-accelerated applications.

PAPI#

The Performance Application Programming Interface (PAPI) provides access to hardware performance counters on CPUs, GPUs, and other system components. PAPI 6.0.0 and later includes components for ROCm that enable monitoring of AMD GPU events through the ROCprofiler library. You can track events such as L1 and L2 cache activity, vector and scalar arithmetic logic unit operations, and memory transactions. The AMD SMI component adds power management support, enabling you to monitor and cap power usage on AMD GPUs.

For more information, see the official PAPI documentation.

HPCToolkit#

HPCToolkit from Rice University measures and analyzes GPU-accelerated applications by recording call-path profiles and traces CPU and GPU activity. It helps you understand how your application uses GPU operations, and attributes the contributions of the calling context in which GPU operations are invoked. HPCToolkit supports multiple programming models, including HIP, and it uses hardware performance counters to measure GPU operations in detail.

For more information, see the HPCToolkit project website.

TAU Performance System#

The TAU Performance System provides a parallel performance evaluation toolkit that supports profiling and tracing modes of measurement. TAU can profile HIP programs using the ROCprofiler and ROCTracer APIs to gather timestamp information of executing kernels on the GPU and data-transfer information. It supports various parallel programming models including, MPI, OpenMP, and Kokkos, and can generate traces in multiple formats for visualization.

For more information, see the TAU project website.

Tracing and visualization#

Tools in this category help you trace application execution and visualize performance data to identify bottlenecks and optimize your code.

Score-P and trace visualization#

Score-P is a highly scalable measurement infrastructure for profiling and event-tracing HPC applications. It uses the ROCTracer library to record HIP API functions, memory transfers between host and device, kernel launches, and other runtime behaviors. Vampir is a commercial tool that visualizes Score-P event logs as timelines and statistical charts, helping you detect performance problems that change over your application’s runtime.

For an open-source approach to visualizing Score-P traces, you can use Scalasca and Cube. Scalasca is a performance analysis toolset for HPC applications, and Cube provides a graphical display for presenting performance metrics and timeline views from Score-P generated OTF2 trace files. These tools are freely available and integrate with Score-P for analyzing MPI, OpenMP, and GPU-accelerated applications.

For more information, see the Score-P documentation, the Vampir website, the Cube download page, and the Scalasca download page.

Trace Compass and Theia#

Trace Compass provides visualization and analysis for several trace formats, including those generated by ROCm applications. It models the system’s state over time, enabling analysis of process threads, GPU compute kernels, and system events. Theia is an open, extensible integrated development environment platform that integrates with Trace Compass via a trace extension, allowing you to analyze traces directly from within your development environment.

For more information, see the Trace Compass project and the Theia IDE project.

Debugging tools#

Debuggers help you identify and fix errors in your HIP applications running on AMD GPUs.

TotalView#

TotalView is a feature-rich debugger that supports HIP applications running on AMD GPUs. It enables you to debug heterogeneous applications that mix processor architectures, with separate address spaces for CPU processes and GPU agents. You can launch, attach to, and detach from processes, display GPU registers and disassembled machine instructions, create breakpoints, and trace code at both source and instruction levels. TotalView provides a unified view of source code and breakpoints across all image files, including dynamically loaded AMD GPU ELF images.

For more information, see the TotalView documentation.

Linaro Forge (DDT and MAP)#

Linaro Forge combines Linaro DDT for parallel debugging and Linaro MAP for performance profiling. DDT is a graphical debugger suitable for heterogeneous software, including GPU programs, while MAP helps you identify the most time-consuming lines of code in your application. Both tools support AMD ROCm programs and can provide insights into GPU kernel execution, memory usage, and performance metrics.

For more information, see the Linaro Forge documentation.

Development environments#

Container-based environments provide consistent development and deployment platforms for ROCm applications.

E4S (Extreme Scale Scientific Software Stack)#

E4S is a curated collection of software products based on the Spack package manager, available as container images supporting Docker and Singularity runtimes. The E4S images include ROCm with compilers, and Python-based artificial intelligence and machine learning tools such as PyTorch for ROCm and TensorFlow for ROCm. These containers enable you to leverage ready-to-use environments for HPC and AI or ML development with AMD GPUs, providing consistent environments from workstations to large-scale datacenter deployments.

For more information, see the E4S project website.