昇腾CANN/PTO-ISA自定义算子示例-平芜编程栈

Custom PyTorch Operator (KERNEL_LAUNCH) Example

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

This example shows how to implement a custom PTO-based kernel in auto mode and expose it as a PyTorch operator viatorch_npu.

Directory Layout

demos/baseline/add/ ├── op_extension/ # Python package entry (module loader) ├── csrc/ │ ├── kernel/ # PTO kernel implementation in auto mode │ └── host/ # Host-side PyTorch operator registration ├── test/ # Minimal Python test ├── CMakeLists.txt # Build configuration ├── setup.py # Wheel build script └── README.md # This document

1. Implement the kernel

Add a kernel source file underauto_mode/demos/baseline/add/csrc/kernel/and include it in the build. For example, to buildadd_custom.cpp, add it toauto_mode/demos/baseline/add/CMakeLists.txt:

ascendc_library(no_workspace_kernel STATIC csrc/kernel/add_custom.cpp ) ascendc_compile_options(no_workspace_kernel PRIVATE --cce-enable-pto-passes -O2)

Unlike manual mode, you don't need to manually callTASSIGNand synchronization instructions in your kernel; the compiler will take care of them for you.

NOTE:

add--cce-enable-pto-passesto enable auto mode of compiler
kernels must be compiled using -O2
currently, this auto mode example doesn't use double buffering, and it's strongly recommended NOT to use buffer/multi-buffering in auto mode, because it's not fully supported yet.

For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c

2. Integrate with PyTorch (`torch_npu`)

The host-side implementation lives underauto_mode/demos/baseline/add/csrc/host/.

2.1 Define the operator schema (Aten IR)

PyTorch usesTORCH_LIBRARY/TORCH_LIBRARY_FRAGMENTto declare operator schemas that can be called from Python viatorch.ops.<namespace>.<op_name>.

Example: register a custommy_addoperator in thenpunamespace:

TORCH_LIBRARY_FRAGMENT(npu, m) { m.def("my_add(Tensor x, Tensor y) -> Tensor"); }

After this, Python can calltorch.ops.npu.my_add.

2.2 Implement the operator

Include the generated kernel launch headeraclrtlaunch_<kernel_name>.h(generated by the build system).
Allocate output tensors/workspace as needed.
Enqueue the kernel viaACLRT_LAUNCH_KERNEL(wrapped byEXEC_KERNEL_CMDin this example).

#include "utils.h" #include "aclrtlaunch_add_custom.h" at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { at::Tensor z = at::empty_like(x); uint32_t blockDim = 20; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength); return z; }

2.3 Register the implementation

Register the implementation withTORCH_LIBRARY_IMPL. For NPU execution,torch_npuuses thePrivateUse1dispatch key, please find the detailed introcution ofPrivateUse1on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html

TORCH_LIBRARY_IMPL(npu, PrivateUse1, m) { m.impl("my_add", TORCH_FN(run_add_custom)); }

3. Build and run

This example requires PTO Tile Lib, PyTorch,torch_npu, and CANN. Follow the officialtorch_npuinstallation guide:

https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85

python3 -m pip install -r requirements.txt

3.1 Set the target SoC

Editauto_mode/demos/baseline/add/CMakeLists.txtand setSOC_VERSIONto your target (example: A2A3 usesAscend910B1):

set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")

You can query the chip name on the target machine vianpu_smi infoand useAscend<Chip Name>as the value.

3.2 Build the wheel

Set the PTO Tile Lib path and build a wheel:

export ASCEND_HOME_PATH=/usr/local/Ascend/ source /usr/local/Ascend/ascend-toolkit/set_env.sh export PTO_LIB_PATH=[YOUR_PATH]/pto-isa rm -rf build op_extension.egg-info python3 setup.py bdist_wheel

3.3 Install the wheel

cd dist pip uninstall *.whl pip install *.whl

3.4 Run the test

cd test python3 test.py

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

ChatGPT在兽医临床、教育与科研中的应用、挑战与伦理考量

1. 项目概述：当AI兽医走进诊室最近几年，AI工具在医疗领域的渗透速度超乎想象，从影像诊断到药物研发，几乎无处不在。作为一名在兽医一线摸爬滚打了十几年的临床医生，我最初对ChatGPT这类大语言模型的态度是谨慎甚至略带…

李华

AI平台竞争中的合谋与网络效应：市场博弈的底层逻辑分析

1. 项目概述：当AI平台开始“默契”时，市场会发生什么？ 最近和几位做投资和产品战略的朋友聊天，话题总绕不开一个现象：几个头部的AI大模型平台，无论是定价策略、功能迭代节奏，还是对开发者的扶持…

李华

医疗AI风险缓解：构建14项功能需求的技术护栏框架

1. 医疗AI风险缓解：为什么我们需要一套“技术护栏”？在医疗领域引入人工智能，听起来像是科幻电影里的情节正在变成现实。作为一名长期关注医疗技术落地的从业者，我亲眼见证了AI从实验室的论文走向临床科室的屏幕。它能从海量影像中…

李华

告别论文熬夜，让AI成为你的学术伙伴：百考通AI毕业论文全流程辅助指南

又到了一年毕业季，校园走廊里、图书馆角落，总能看到眉头紧锁的同学对着电脑屏幕，反复修改着那篇似乎永远无法“定稿”的毕业论文。从开题报告的迷茫，到初稿被批“思路不清”，再到格式调整的繁琐和查重降重的煎熬——本…

李华

蚂蚁百灵发布万亿级旗舰思考模型 Ring-2.6-1T，限时免费体验，测评成绩亮眼！

蚂蚁百灵正式发布面向真实复杂任务场景的万亿级旗舰思考模型 Ring-2.6-1T，并开放限时一周免费体验。该模型在效果、速度与成本间取得更优平衡，测评成绩显著。模型发布与目标蚂蚁百灵宣布 Ring-2.6-1T 正式发布，其目标不只是追求模型的聪明程度…

李华

量子人工智能融合：从原理到NISQ时代的混合算法实践

1. 项目概述：当量子遇见智能量子计算与人工智能，这两个被誉为将引领下一次科技革命的核心技术，正在以前所未有的速度走向融合。作为一名长期关注前沿技术落地的从业者，我亲眼见证了从最初的理论猜想，到如今各大科技巨头…

李华