news 2026/5/10 0:20:07

昇腾CANN/PTO-ISA自定义算子示例

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
昇腾CANN/PTO-ISA自定义算子示例

Custom PyTorch Operator (KERNEL_LAUNCH) Example

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

This example shows how to implement a custom PTO-based kernel in auto mode and expose it as a PyTorch operator viatorch_npu.

Directory Layout

demos/baseline/add/ ├── op_extension/ # Python package entry (module loader) ├── csrc/ │ ├── kernel/ # PTO kernel implementation in auto mode │ └── host/ # Host-side PyTorch operator registration ├── test/ # Minimal Python test ├── CMakeLists.txt # Build configuration ├── setup.py # Wheel build script └── README.md # This document

1. Implement the kernel

Add a kernel source file underauto_mode/demos/baseline/add/csrc/kernel/and include it in the build. For example, to buildadd_custom.cpp, add it toauto_mode/demos/baseline/add/CMakeLists.txt:

ascendc_library(no_workspace_kernel STATIC csrc/kernel/add_custom.cpp ) ascendc_compile_options(no_workspace_kernel PRIVATE --cce-enable-pto-passes -O2)

Unlike manual mode, you don't need to manually callTASSIGNand synchronization instructions in your kernel; the compiler will take care of them for you.

NOTE:

  1. add--cce-enable-pto-passesto enable auto mode of compiler
  2. kernels must be compiled using -O2
  3. currently, this auto mode example doesn't use double buffering, and it's strongly recommended NOT to use buffer/multi-buffering in auto mode, because it's not fully supported yet.

For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c

2. Integrate with PyTorch (torch_npu)

The host-side implementation lives underauto_mode/demos/baseline/add/csrc/host/.

2.1 Define the operator schema (Aten IR)

PyTorch usesTORCH_LIBRARY/TORCH_LIBRARY_FRAGMENTto declare operator schemas that can be called from Python viatorch.ops.<namespace>.<op_name>.

Example: register a custommy_addoperator in thenpunamespace:

TORCH_LIBRARY_FRAGMENT(npu, m) { m.def("my_add(Tensor x, Tensor y) -> Tensor"); }

After this, Python can calltorch.ops.npu.my_add.

2.2 Implement the operator

  1. Include the generated kernel launch headeraclrtlaunch_<kernel_name>.h(generated by the build system).
  2. Allocate output tensors/workspace as needed.
  3. Enqueue the kernel viaACLRT_LAUNCH_KERNEL(wrapped byEXEC_KERNEL_CMDin this example).
#include "utils.h" #include "aclrtlaunch_add_custom.h" at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { at::Tensor z = at::empty_like(x); uint32_t blockDim = 20; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength); return z; }

2.3 Register the implementation

Register the implementation withTORCH_LIBRARY_IMPL. For NPU execution,torch_npuuses thePrivateUse1dispatch key, please find the detailed introcution ofPrivateUse1on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html

TORCH_LIBRARY_IMPL(npu, PrivateUse1, m) { m.impl("my_add", TORCH_FN(run_add_custom)); }

3. Build and run

This example requires PTO Tile Lib, PyTorch,torch_npu, and CANN. Follow the officialtorch_npuinstallation guide:

https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85

or

python3 -m pip install -r requirements.txt

3.1 Set the target SoC

Editauto_mode/demos/baseline/add/CMakeLists.txtand setSOC_VERSIONto your target (example: A2A3 usesAscend910B1):

set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")

You can query the chip name on the target machine vianpu_smi infoand useAscend<Chip Name>as the value.

3.2 Build the wheel

Set the PTO Tile Lib path and build a wheel:

export ASCEND_HOME_PATH=/usr/local/Ascend/ source /usr/local/Ascend/ascend-toolkit/set_env.sh export PTO_LIB_PATH=[YOUR_PATH]/pto-isa rm -rf build op_extension.egg-info python3 setup.py bdist_wheel

3.3 Install the wheel

cd dist pip uninstall *.whl pip install *.whl

3.4 Run the test

cd test python3 test.py

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/10 0:14:33

ChatGPT在兽医临床、教育与科研中的应用、挑战与伦理考量

1. 项目概述&#xff1a;当AI兽医走进诊室最近几年&#xff0c;AI工具在医疗领域的渗透速度超乎想象&#xff0c;从影像诊断到药物研发&#xff0c;几乎无处不在。作为一名在兽医一线摸爬滚打了十几年的临床医生&#xff0c;我最初对ChatGPT这类大语言模型的态度是谨慎甚至略带…

作者头像 李华
网站建设 2026/5/10 0:13:37

AI平台竞争中的合谋与网络效应:市场博弈的底层逻辑分析

1. 项目概述&#xff1a;当AI平台开始“默契”时&#xff0c;市场会发生什么&#xff1f; 最近和几位做投资和产品战略的朋友聊天&#xff0c;话题总绕不开一个现象&#xff1a;几个头部的AI大模型平台&#xff0c;无论是定价策略、功能迭代节奏&#xff0c;还是对开发者的扶持…

作者头像 李华
网站建设 2026/5/10 0:12:31

医疗AI风险缓解:构建14项功能需求的技术护栏框架

1. 医疗AI风险缓解&#xff1a;为什么我们需要一套“技术护栏”&#xff1f;在医疗领域引入人工智能&#xff0c;听起来像是科幻电影里的情节正在变成现实。作为一名长期关注医疗技术落地的从业者&#xff0c;我亲眼见证了AI从实验室的论文走向临床科室的屏幕。它能从海量影像中…

作者头像 李华
网站建设 2026/5/10 0:04:20

量子人工智能融合:从原理到NISQ时代的混合算法实践

1. 项目概述&#xff1a;当量子遇见智能量子计算与人工智能&#xff0c;这两个被誉为将引领下一次科技革命的核心技术&#xff0c;正在以前所未有的速度走向融合。作为一名长期关注前沿技术落地的从业者&#xff0c;我亲眼见证了从最初的理论猜想&#xff0c;到如今各大科技巨头…

作者头像 李华