使用 Nsight Compute 来优化 CUDA 程序性能-平芜编程栈

Nsight Compute

Nsight Compute 是深度剖析某个 Kernel 核函数性能表现的关键工具，它使用了 CUPTI 的 Event API、Metric API 和 CUDA Profiling API 来记录和采集 Kernel 核函数所执行的指令、内存事务、Warp 占用率等事件。

影响 Kernel 核函数性能的关键因素包括：寄存器利用率、计算核利用率、内存访问效率、通信效率等。
所以常用的分析指标有：

SM 利用率：查看 GPU 的 SM 单元是否有效地被使用，是否有空闲时间。
内存带宽：分析 GPU 内存的利用率，确定是否有内存带宽瓶颈。
线程发散：检查是否有线程 if-else 分支导致的性能下降（warp divergence）。
寄存器压力：分析核函数是否使用了过多寄存器，影响并行性。
L1/L2 cache 命中率：缓存命中率低可能是性能瓶颈的原因。

ncu CLI

安装

安装了 CUDA Toolkit 之后就可以使用 ncu CLI 了。

使用

常规使用：

ncu --set full --target-processes all -f -o my_report ./your_program

–set full：采集核函数的全部指标信息，包含内存使用、执行效率等详细的性能指标。可使用 ncu --list-sets 查看 GPU 支持采集的 sections。
–list-metrics / --metrics：列出或指定要收集的性能指标
–target-processes application-only：
–target-processes all：捕获程序中所有的 CUDA 核函数执行信息。
-f：强制覆盖同名输出报告文件。
-o 参数指定输出报告的文件名。
./your_program：你想要分析的 CUDA 程序。
-k kernel_name：指定要分析的 Kernel 名称。支持通配符，如：-k “attention*” 表示匹配所有以 attention 开头的 Kernel。
–launch-count N：仅捕获前 N 次 Kernel 启动。避免重复分析相同逻辑的 Kernel，减少数据量。

指定核函数进行分析：

ncu --set full --kernel-name"my_kernel_name"./your_program

捕获特定事件：通过多个 --metrics 参数捕获特定的指标，例如 L2 缓存命中率。如不使用该参数则采集所有的核函数。

ncu --metrics lts__t_sectors_lookup_hit.sum ./your_program

可以创建自定义的性能分析集：

ncu --set custom-set --metrics sm__inst_executed.sum,sm__warps_launched ./your_program

示例：

$sudoncu --set full --target-processes all --query-metrics -o cuda_stress_testreport ./cuda_stress_test Starting CUDA Stress Testfor2minutes...==PROF==Connected to process796(/home/mikey/workspace/cuda_test/cuda_stress_test)Number of CUDA devices:1Device0: NVIDIA GeForce RTX4060Laptop GPU Compute capability:8.9Total global memory:8.00GB Multiprocessors:24Max threads per block:1024Max threads per multiprocessor:1536Running stresstestfor2minutes... Iteration|Time(ms)|GFLOP/s ----------|-----------|---------==PROF==Profiling"matrixMultiply"-0:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-1:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-2:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-3:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-4:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-5:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-6:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-7:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-8:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-9:0%....50%....100% -43passes10|1460.127|1.47==PROF==Profiling"matrixMultiply"-10:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-11:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-12:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-13:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-14:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-15:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-16:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-17:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-18:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-19:0%....50%....100% -43passes20|1501.078|1.43==PROF==Profiling"matrixMultiply"-20:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-21:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-22:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-23:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-24:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-25:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-26:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-27:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-28:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-29:0%....50%....100% -43passes30|1451.803|1.48==PROF==Profiling"matrixMultiply"-30:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-31:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-32:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-33:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-34:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-35:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-36:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-37:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-38:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-39:0%....50%....100% -43passes40|1458.267|1.47==PROF==Profiling"matrixMultiply"-40:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-41:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-42:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-43:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-44:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-45:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-46:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-47:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-48:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-49:0%....50%....100% -43passes50|1541.696|1.39==PROF==Profiling"matrixMultiply"-50:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-51:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-52:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-53:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-54:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-55:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-56:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-57:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-58:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-59:0%....50%....100% -43passes60|1512.155|1.42==PROF==Profiling"matrixMultiply"-60:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-61:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-62:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-63:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-64:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-65:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-66:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-67:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-68:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-69:0%....50%....100% -43passes70|1530.311|1.40==PROF==Profiling"matrixMultiply"-70:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-71:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-72:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-73:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-74:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-75:0%....50%....100% -44passes==PROF==Profiling"matrixMultiply"-76:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-77:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-78:0%....50%....100% -43passes==PROF==Profiling"matrixMultiply"-79:0%....50%....100% -43passes80|1477.990|1.45==PROF==Profiling"matrixMultiply"-80:0%....50%....100% -43passes Test completed successfully!Total iterations:81Total time:120.99seconds Average performance:1.43GFLOP/s CUDA stresstestcompleted successfully!==PROF==Disconnected from process796==PROF==Report: /home/mikey/workspace/cuda_test/cuda_stress_testreport.ncu-rep

使用 ncu profiling 的过程中可以看出显著的性能损耗。

由于 GPU 可用计数器有限，所以 ncu 通常会多次重放（replay）每个核函数才能收集到所有请求的指标。例如：一个核函数可能会运行数十次以收集不同的计数器集，然后再将这些计数器聚合。因此 Nsight Compute 会产生显著的性能开销。

对一个核函数进行多次 replay 的模式有很多种，可以根据 replay 模式不同将 ncu 划分成不同的运行模式。

一般模式：–replay-mode 参数被设置为 kernel 或 application，ncu 会串行化进程中的所有核函数来进行 Profiling，所以如果有需要与其他进程之间进行强制同步的核函数（如 NCCL、MPI 通信等），Profiling 的运行就有可能被卡住。所以该模式只适合用于串行程序，或者一部分不含同步的并行程序。
- kernel：程序只会被运行一次，但是这个程序中的所有核函数都会被 profiling N 次。
- application：程序会被执行 N 次，每次执行中，程序内部的所有核函数只会被 profiling 1 次。
Range Profiling 模式：–replay-mode 参数被设置为 app-range，解决了一般模式无法 profiling 包含同步核函数程序的问题，支持通过修改代码插入 cudaProfilerStart 和 cudaProfilerStop 来添加 Profiling 范围，并只对该范围进行 Profiling。

其他常用选项：

–launch-skip 1000：设定核函数启动前跳过的次数。
–launch-count 1：设定核函数启动的次数。

全部选项详见：https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison

GUI

安装

MacOS 直接下载安装即可。

远程模式

分析

打开 Nsight Compute GUI 通过 File -> Open 选项加载 .ncu-rep 报告文件。

Nsight Compute GUI 的 Details 标签页具有以下部分内容：

GPU Speed Of Light Throughput：查看 GPU 计算和存储资源的整体使用情况。
Compute Workload Analysis：查看 SM 计算资源的使用情况，以及 IPC。分析核函数的计算效率，如指令执行率、活跃线程数等。
Memory Workload Analysis：查看各级存储的使用情况。显示全局内存、共享内存的访问情况，以及缓存命中率等。
Scheduler Statistics：查看 Warp Scheduler 发射指令的情况。
Warp State Statistics：查看核函数执行期间 Warp 的状态信息。
Instruction Statistics：查看 SASS 汇编指令组成和执行情况。显示核函数执行的各类指令的统计数据。
Launch Statistics：查看核函数启动的资源配置情况，包括：grid/block/thread/warp/register/shared memory。
Occupancy：查看 Warp 占用率。

优化 CUDA 核函数的常见步骤：

分析核函数执行时间：查看每个核函数的执行时间，找出执行时间最长的核函数。
寄存器和共享内存使用：检查核函数是否使用了过多的寄存器或共享内存。过高的寄存器使用会限制并行线程数，从而降低 SM 的利用率。
优化内存访问模式：确保全局内存访问是共 alesced（合并访问）的，减少未对齐的访问。还可以通过使用共享内存来减少全局内存访问的压力。
优化线程发散：确保所有线程尽量执行相同的指令，避免 warp divergence。
增加核函数的并行度：通过调整线程块的大小或重构算法，增加 GPU 上并行执行的工作量。

NVIDIA DCGM

NVIDIA GPU 内置了一些硬件计数器，这些计数器用于收集一些设备级别的性能指标，例如：GPU 利用率、内存使用情况等。借助 NVIDIA NVML（NVIDIA Management Library）编程库提供给 nvidia-smi 与 DCGM（Data Center GPU Manager）等工具进行数据查询。

DCGM 是 GPU 集群级别的遥测和监控技术，采用分布式架构。具有一个聚合服务器以及多个 nv-hostengine 主机服务。以秒为单位，nv-hostengine 与 GPU Driver 接口交互用于收集 GPU 的指标，包括：计算利用率、内存利用率、温度、功耗、时钟速度、ECC 内存错误、PCIe 吞吐量、NVLink 吞吐量等，具有非常低的开销。在 Kubernetes 环境中，提供了 DCGM-Exporter 与 Prometheus 以及 Grafana 进行集成。

官网：https://developer.nvidia.com/dcgm

区别于上述技术，DCGM 用于支持 Metrics 的监控场景，而 Nsight 用于 Tracing 和 Profiling 的性能优化场景。所以 DCGM 也具有以下特性与局限性。

特性：

完全透明的数据收集：直接从硬件层面采集数据，对应用程序的性能几乎没有影响。即：应用程序无需更改任何代码或配置就可以启用性能数据的采集功能，实现完全透明的数据收集过程。
支持连续的、实时的性能监控：无论是否有应用运行，都能持续收集相关数据。

局限性：

不感知 CUDA 程序：如果在一张 GPU 卡上同时运行多个应用程序时，无法精确区分每个应用程序对 GPU 资源的具体占用情况。
不感知 CPU 调度：硬件层的性能分析只能反映 GPU 端的资源使用和性能表现，无法获知 CPU 与 GPU 之间的协同工作状态，例如 CPU 的调度效率、任务发送速度等因素对 CUDA 应用整体性能的影响。

参考文档

https://help.aliyun.com/zh/ack/cloud-native-ai-suite/use-cases/using-nsight-system-to-realize-performance-analysis?spm=a2c4g.11186623.0.0.38ee4467xFeFF0#54538d9a0ecbe
https://help.aliyun.com/zh/ack/cloud-native-ai-suite/use-cases/performance-analysis-of-gpu-using-dcgm?spm=a2c4g.11186623.help-menu-85222.d_3_0_0.6985d32fUx0B28&scm=20140722.H_2710009._.OR_help-T_cn~zh-V_1
https://www.hikunpeng.com/document/detail/zh/perftuning/gputuing-wp/kunpenggpu_19_0001.html
https://www.bilibili.com/video/BV15P4y1R7VG/?spm_id_from=333.337.search-card.all.click&vd_source=cafcc13c6e04e5a5d82acfa6ad3b7ddf
https://www.bilibili.com/video/BV13w411o7cu/?spm_id_from=333.337.search-card.all.click&vd_source=cafcc13c6e04e5a5d82acfa6ad3b7ddf
https://www.cnblogs.com/zhaoweiwei/p/19048895/NsightSystems