CANN asc-devkit基础API指南-平芜编程栈

Basic API Contribution Guide

【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言，原生支持C和C++标准规范，主要由类库和语言扩展层构成，提供多层级API，满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit

Overview

Basic API is the instruction-level API layer in the Ascend C programming framework. It directly wraps hardware instructions of Ascend AI processors and uses C++ style function interfaces. Basic API serves as the foundation for building high-level APIs. Developers can implement complex algorithm logic by combining basic APIs.

Core Features of Basic API:

Instruction-level encapsulation: Each API maps to one or more hardware instructions.
LocalTensor abstraction: UsesLocalTensor<T>type to operate memory.
Template design: Supports multiple data types (half, float, int16_t, int32_t, and so on).
Dual interfaces: High-dimensional tiling computation (fine control) and first-n elements computation (simplified invocation).
Architecture adaptation: Supports different NPU architectures through architecture macro definitions.

Development Process

Requirement Analysis

Define API functionality (for example, Add, Mul, Relu).
Determine supported data types.
Analyze hardware instruction support.

API Design

Define function prototypes (using LocalTensor).
Design high-dimensional tiling computation and first-n elements computation interfaces.
Define parameter specifications (mask, repeat, stride, and so on).

Implementation Development

Write interface declarations (include/basic_api/).
Implement core logic (impl/basic_api/).
Handle architecture differences.

Test and Verification

Write unit tests.
Verify functional correctness.
Check boundary conditions.

Documentation

Complete API documentation.
Provide usage examples.
Explain constraints.

API Introduction

High-dimensional Tiling Computation vs First-n Elements Computation Interface

High-dimensional Tiling Computation (Fine Control)

// Requires manual setting of mask and repeat parameters template <typename T, bool isSetMask = true> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, uint64_t mask[], // mask array const uint8_t repeatTime, // repeat count const BinaryRepeatParams& repeatParams); // stride parameters

Applicable Scenarios:

Require fine control over computation process.
Non-contiguous memory access.
Performance optimization.

First-n Elements Computation (Simplified Invocation)

// Automatically handles mask and repeat template <typename T> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, const int32_t& count); // only element count needed

Applicable Scenarios:

Contiguous memory block computation.
Simplified code.
Rapid development.

Directory Planning

Directory Structure

asc-devkit/ ├── include/ │ └── basic_api/ # Basic API header files │ ├── kernel_operator_common_intf.h # Common interface │ ├── kernel_operator_vec_binary_intf.h # Vector binary operations │ ├── kernel_operator_vec_unary_intf.h # Vector unary operations │ ├── kernel_operator_data_copy_intf.h # Data movement │ ├── kernel_operator_fixpipe_intf.h # Fixpipe │ ├── kernel_operator_mm_intf.h # Matrix multiplication │ ├── kernel_operator_scalar_intf.h # Scalar operations │ ├── kernel_operator_sys_var_intf.h # System variables │ ├── kernel_operator_atomic_intf.h # Atomic operations │ ├── kernel_tensor.h # Tensor definition │ └── kernel_struct_*.h # Parameter structures │ ├── impl/ │ └── basic_api/ # Basic API implementation │ ├── dav_m200/ # NPU ARCH 200x architecture │ │ ├── kernel_operator_vec_binary_impl.h │ │ └── ... │ ├── dav_c220/ # NPU ARCH 220x architecture │ │ ├── kernel_operator_vec_binary_impl.h │ │ └── ... │ └── CMakeLists.txt │ ├── tests/ │ └── api/ │ └── basic_api/ # Basic API tests │ ├── tikcpp_case_common/ │ │ └── test_operator_axpy.cpp │ ├── tikcpp_case_ascend910/ │ │ └── ... │ └── tikcpp_case_ascend910b1/ │ └── ... │ └── docs/ └── api/ └── context/ └── ... # Basic API documentation

File Naming Conventions

File Type	Naming Convention	Example
Interface header	`kernel_operator_<category>_intf.h`	`kernel_operator_vec_binary_intf.h`
Implementation file	`kernel_operator_<category>_impl.h`	`kernel_operator_vec_binary_impl.h`
Test file	`test_operator_<category>.cpp`	`test_operator_vec_binary.cpp`
Documentation file	`<api>.md`	`Add.md`

API Categories

Category	Description	Example APIs
vec_binary	Vector binary operations	Add, Sub, Mul, Div, Max, Min
vec_unary	Vector unary operations	Relu, Exp, Cast, Abs
vec_reduce	Vector reduction	Sum, Max, Mean
data_copy	Data movement	DataCopy, LoadData
fixpipe	Pipeline control	Fixpipe
mm	Matrix multiplication	Mmad, Conv2D
scalar	Scalar operations	ToFloat
atomic	Atomic operations	AtomicAdd, AtomicCAS

Architecture Design

Implementation Layers

Layer 1: Interface Declaration Layer (include/basic_api/)

// include/basic_api/kernel_operator_vec_binary_intf.h #ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H #define ASCENDC_MODULE_OPERATOR_VEC_BINARY_INTERFACE_H #include "kernel_tensor.h" #include "kernel_struct_binary.h" namespace AscendC { // Add - High-dimensional tiling computation template <typename T, bool isSetMask = true> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, uint64_t mask[], const uint8_t repeatTime, const BinaryRepeatParams& repeatParams); // Add - First-n elements computation template <typename T> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, const int32_t& count); } // namespace AscendC #include "impl/basic_api/kernel_operator_vec_binary_intf_impl.h" #endif

Layer 2: Instruction Implementation Layer (impl/basic_api/)

// impl/basic_api/dav_c220/kernel_operator_vec_binary_impl.h #ifndef ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H #define ASCENDC_MODULE_OPERATOR_VEC_BINARY_IMPL_H namespace AscendC { // Add implementation - First-n elements computation template <typename T> __aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0, __ubuf__ T* src1, const int32_t& count) { if ASCEND_IS_AIV { // 1. Set mask set_mask_count(); set_vector_mask(0, count); // 2. Call underlying instruction vadd(dst, src0, src1, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE, DEFAULT_REPEAT_STRIDE); // 3. Restore mask set_mask_norm(); set_vector_mask(static_cast<uint64_t>(-1), static_cast<uint64_t>(-1)); } } // Add implementation - High-dimensional tiling computation template <typename T, bool isSetMask = true> __aicore__ inline void AddImpl(__ubuf__ T* dst, __ubuf__ T* src0, __ubuf__ T* src1, const uint64_t mask[], const uint8_t repeatTime, const BinaryRepeatParams& repeatParams) { if ASCEND_IS_AIV { // Set mask (if needed) if (isSetMask) { AscendCUtils::SetMask<T, isSetMask>(mask[1], mask[0]); } // Call underlying instruction vadd(dst, src0, src1, repeatTime, repeatParams.dstBlkStride, repeatParams.src0BlkStride, repeatParams.src1BlkStride, repeatParams.dstRepStride, repeatParams.src0RepStride, repeatParams.src1RepStride); } } } // namespace AscendC #endif

Layer 3: Interface Wrapper Layer

// impl/basic_api/kernel_operator_vec_binary_intf_impl.h namespace AscendC { // First-n elements computation interface wrapper template <typename T> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, const int32_t& count) { AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count); } // High-dimensional tiling computation interface wrapper template <typename T, bool isSetMask = true> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, uint64_t mask[], const uint8_t repeatTime, const BinaryRepeatParams& repeatParams) { AddImpl<T, isSetMask>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), mask, repeatTime, repeatParams); } } // namespace AscendC

Architecture Adaptation

Hardware may differ across NPU architectures and requires reimplementation.

Development Example: Implementing Axpy Basic API

API Requirement Analysis

Implement vector multiply-add:dst = src * scalar + dst

Supported data types: half, float
Interface type: First-n elements computation (simplified invocation)
Hardware support: Confirm hardware support

Review Existing API Structure

Basic API usesLocalTensor<T>as parameters. The first-n elements computation interface only requires the count parameter:

// Reference existing Add interface template <typename T> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, const int32_t& count);

Interface Design

Add ininclude/basic_api/kernel_operator_vec_binary_intf.h:

/* ************************************************************************************************** * Axpy * * ************************************************************************************************* */ /* * @ingroup Axpy * @brief dst = dst + src * scalar * @param [out] dst output LocalTensor * @param [in] src input LocalTensor * @param [in] scalar scalar value * @param [in] count number Number of data involved in calculation */ template <typename T, typename U> __aicore__ inline void Axpy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const U scalar, const int32_t& count);

Implementation Code

Reference other interface implementations.

Interface Wrapper

Add inimpl/basic_api/kernel_operator_vec_binary_intf_impl.h:

template <typename T, typename U> __aicore__ inline void Axpy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const U scalar, const int32_t& count) { AxpyImpl<T, U>(dst.GetPtr(), src.GetPtr(), scalar, count); }

Test Code

Add test code for the corresponding interface.

Test and Verification Requirements

Functional Testing

Verify API computation correctness.

Boundary Testing

TEST_F(TestAxpy, BoundaryTest) { // Test boundary values: count=0, 1, 256, 257 // Test different data type combinations // Test special values (NaN, Inf) }

Data Type Testing

INSTANTIATE_TEST_CASE_P(TEST_AXPY_TYPES, AxpyTestsuite, ::testing::Values( BinaryTestParams { 256, 2, 2, main_axpy<half, half> }, BinaryTestParams { 256, 4, 2, main_axpy<float, half> }, BinaryTestParams { 256, 4, 4, main_axpy<float, float> } ) );

Code Standards

Naming Conventions

// Function name: PascalCase, first letter uppercase void Add(...); void Relu(...); void Axpy(...); // Parameter name: camelCase LocalTensor<T> dstTensor; int32_t elementCount; // Macro definition: UPPERCASE_WITH_UNDERSCORES #define ASCENDC_ASSERT(cond, msg) ... // Type name: PascalCase struct BinaryRepeatParams; class LocalTensor;

Code Style

// 1. Indentation: 4 spaces // 2. Braces: K&R style // 3. Spaces: Spaces around operators // 4. Comments: Doxygen style /** * @brief Vector addition operation * @param dst Destination LocalTensor * @param src0 Source LocalTensor 0 * @param src1 Source LocalTensor 1 * @param count Element count */ template <typename T> __aicore__ inline void Add(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, const int32_t& count) { // Parameter validation ASCENDC_ASSERT(count > 0, "count must be positive"); // Call implementation AddImpl<T>(dst.GetPtr(), src0.GetPtr(), src1.GetPtr(), count); }

Error Handling

// 1. Parameter validation (Debug mode) ASCENDC_ASSERT(count > 0, "count must be greater than 0"); ASCENDC_ASSERT(dst != nullptr, "dst cannot be nullptr"); ASCENDC_ASSERT(src != nullptr, "src cannot be nullptr"); // 2. Type checking static_assert(SupportType<T, half, float, int16_t, int32_t>(), "Unsupported data type"); // 3. Architecture checking #if !defined(__NPU_ARCH__) || (__NPU_ARCH__ != 2201 && __NPU_ARCH__ != 3510) #error "Unsupported NPU architecture" #endif

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考