Custom PyTorch Operator (KERNEL_LAUNCH) Example
【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa
This example shows how to implement a custom PTO-based kernel in auto mode and expose it as a PyTorch operator viatorch_npu.
Directory Layout
demos/baseline/add/ ├── op_extension/ # Python package entry (module loader) ├── csrc/ │ ├── kernel/ # PTO kernel implementation in auto mode │ └── host/ # Host-side PyTorch operator registration ├── test/ # Minimal Python test ├── CMakeLists.txt # Build configuration ├── setup.py # Wheel build script └── README.md # This document1. Implement the kernel
Add a kernel source file underauto_mode/demos/baseline/add/csrc/kernel/and include it in the build. For example, to buildadd_custom.cpp, add it toauto_mode/demos/baseline/add/CMakeLists.txt:
ascendc_library(no_workspace_kernel STATIC csrc/kernel/add_custom.cpp ) ascendc_compile_options(no_workspace_kernel PRIVATE --cce-enable-pto-passes -O2)Unlike manual mode, you don't need to manually callTASSIGNand synchronization instructions in your kernel; the compiler will take care of them for you.
NOTE:
- add
--cce-enable-pto-passesto enable auto mode of compiler - kernels must be compiled using -O2
- currently, this auto mode example doesn't use double buffering, and it's strongly recommended NOT to use buffer/multi-buffering in auto mode, because it's not fully supported yet.
For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c
2. Integrate with PyTorch (torch_npu)
The host-side implementation lives underauto_mode/demos/baseline/add/csrc/host/.
2.1 Define the operator schema (Aten IR)
PyTorch usesTORCH_LIBRARY/TORCH_LIBRARY_FRAGMENTto declare operator schemas that can be called from Python viatorch.ops.<namespace>.<op_name>.
Example: register a custommy_addoperator in thenpunamespace:
TORCH_LIBRARY_FRAGMENT(npu, m) { m.def("my_add(Tensor x, Tensor y) -> Tensor"); }After this, Python can calltorch.ops.npu.my_add.
2.2 Implement the operator
- Include the generated kernel launch header
aclrtlaunch_<kernel_name>.h(generated by the build system). - Allocate output tensors/workspace as needed.
- Enqueue the kernel via
ACLRT_LAUNCH_KERNEL(wrapped byEXEC_KERNEL_CMDin this example).
#include "utils.h" #include "aclrtlaunch_add_custom.h" at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { at::Tensor z = at::empty_like(x); uint32_t blockDim = 20; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength); return z; }2.3 Register the implementation
Register the implementation withTORCH_LIBRARY_IMPL. For NPU execution,torch_npuuses thePrivateUse1dispatch key, please find the detailed introcution ofPrivateUse1on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html
TORCH_LIBRARY_IMPL(npu, PrivateUse1, m) { m.impl("my_add", TORCH_FN(run_add_custom)); }3. Build and run
This example requires PTO Tile Lib, PyTorch,torch_npu, and CANN. Follow the officialtorch_npuinstallation guide:
https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85
or
python3 -m pip install -r requirements.txt3.1 Set the target SoC
Editauto_mode/demos/baseline/add/CMakeLists.txtand setSOC_VERSIONto your target (example: A2A3 usesAscend910B1):
set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")You can query the chip name on the target machine vianpu_smi infoand useAscend<Chip Name>as the value.
3.2 Build the wheel
Set the PTO Tile Lib path and build a wheel:
export ASCEND_HOME_PATH=/usr/local/Ascend/ source /usr/local/Ascend/ascend-toolkit/set_env.sh export PTO_LIB_PATH=[YOUR_PATH]/pto-isa rm -rf build op_extension.egg-info python3 setup.py bdist_wheel3.3 Install the wheel
cd dist pip uninstall *.whl pip install *.whl3.4 Run the test
cd test python3 test.py【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考