CANNBot Skills A2三重桥接模式-平芜编程栈

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

one cube stage that produces a score tile
vec logic that updates running row state and emits a delayed cube input
a later cube stage that consumes that delayed tile
a final vec stage that accumulates the delayed cube output

Typical target formula:

score_j = q.float() @ k_j.float().t() * scale
curr_m = maximum(prev_m, rowmax(score_j))
expdiff_j = exp(prev_m - curr_m)
p_j = exp(score_j - curr_m).half()
pv_j = p_j.float() @ v_j.float()
out = out * expdiff_j + pv_j

This isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.

Why this needs its own a2 pattern

This topology combines all a2 bridge constraints in one kernel:

cube -> vec cannot usel0c_to_ub
vec -> cube cannot useub_to_l1_*
the delayed cube output must return to vec for the final accumulation

So the stable data path is:

GM(q,k,v) -> L1 -> L0 -> L0C(score) -> GM(score_ws) -> UB(score)-> GM(p_ws) -> L1 -> L0 -> L0C(pv) -> GM(pv_ws) -> UB(pv) -> UB(accum) -> GM(out)

Use explicit workspaces instead of pretending this can stay on chip end-to-end.

Workspaces and ownership edges

Use three GM workspaces:

score_ws
- dtype:float
- shape:[GetCubeNum(), 2, TILE_M, TILE_N]
- purpose:L0C(score)->UB(score)
p_ws
- dtype:half
- shape:[GetCubeNum(), 2, TILE_M, TILE_N]
- purpose:UB(p_j)->L1(p_j)
pv_ws
- dtype:float
- shape:[GetCubeNum(), 2, TILE_M, D]
- purpose:L0C(pv_j)->UB(pv_j)

Ownership edges:

stage 1 cube -> vec:CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)
stage 1 vec -> stage 2 cube:VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)
stage 2 cube -> stage 3 vec:CvMutex(2, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Stable schedule

Use one-tile lookahead:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce tile j = ni if ni > 0: # stage 2 + stage 3: consume tile j = ni - 1

This gives:

warmup: first iteration only produces
steady state: producejwhile consumingj - 1
drain: final iteration only consumes the last delayed tile

Shared`L0C`rule

Reuse one physicalL0Cfamily across the two cube stages.

Why this is the stable a2 choice here:

stage 1 writes a full float[TILE_M, TILE_N]score tile
stage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD == 128
a2 only has128 KBL0C, so a second full float family would be a misleading design target

Stable ownership story:

keep onel0c = DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)
let stage 1 publishscore_wsbefore stage 2 reuses that slot
let stage 2 publishpv_wsbefore the next stage-1 reuse
advance one sharedl0c_cnt

This is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.

Counter layout

Keep these lifetimes separate:

l1qk_cnt: stage-1q/kloads
l1pv_cnt: stage-2p/vloads
l0c_cnt: shared physicalL0Cfamily across the two cube stages
stage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiff
stage2_cnt: delayed slot rhythm forp_wsconsumption andpv_ws

Do not hide the delayed accumulator lifetime behindstage1_cnt.

Vec-resident persistent state

Keep these values in per-subblock UB across the whole inner loop:

running row max:[HALF_M, 1]
delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)
final numerator accumulation:[HALF_M, D]

UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.

Critical scalar-state rule on a2

Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.

Reason:

ub_to_ubinfers burst length in units ofC0blocks
for[64, 1]float views, that means copying 8 elements per row
this silently miscopies row-scalar state such asprev_m

Stable fix:

keep scalar state in[HALF_M, 1]
copy it with a vec binary op that respects the[M,1]stride model, for example:

dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)

Then update or transform that copied buffer with more vec ops.

Delayed`expdiff`handling

expdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.

Stable pattern:

stage 1 copiesprev_minto the delayedexpdiffslot
stage 1 updates running max
stage 1 overwrites the delayed slot withexp(prev_m - curr_m)
stage 3 later reads that same slot and broadcasts it before scalingaccum

Usestage1_cntparity for the write slot andstage2_cntparity for the read slot.

Final vec accumulation

After loadingpv_jback into UB:

brcbthe delayedexpdiffslot to[HALF_M, 8]
scaleaccum[:, 0:64]
scaleaccum[:, 64:128]
add(accum, accum, pv_j)

Why sliced scaling is required:

accumis wide ([HALF_M, 128])
expdiffbroadcast is narrow ([HALF_M, 8])
follow the same sliced-row rule used for row-max subtraction

Validation target

Keep the first validated contract narrow:

D == 128
S1 % 128 == 0
S2 % 128 == 0
inputq/k/varefloat16
output isfloat32

Suggested cases:

(1, 1, 256, 512, 128)
(1, 3, 256, 512, 128)
(1, 3, 2048, 4096, 128)

Files to study

agent/example/kernels/a2/flash_attn_score_iter.py
agent/example/kernels/a2/flash_attn_score_pv.py
agent/example/kernels/a2/flash_attn_unnorm.py
agent/references/patterns/a2-cube-vec.md
agent/references/patterns/a2-cube-vec-cube.md
agent/references/constraints/a2-device.md
agent/references/constraints/vec-reduction-a2.md
agent/references/constraints/vec-stride.md

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANNBot Skills A2三重桥接模式

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

Why this needs its own a2 pattern

Workspaces and ownership edges

Stable schedule

Shared`L0C`rule

Counter layout

Vec-resident persistent state

Critical scalar-state rule on a2

Delayed`expdiff`handling

Final vec accumulation

Validation target

Files to study

GHelper：释放华硕笔记本隐藏性能的轻量级控制神器

基于语义搜索的代码索引工具：从原理到部署实战

Observal：自托管AI编程智能体管理与可观测性平台实践

CANN ops-math 贡献指南

新能源汽车电池生产线实战：C#上位机+Modbus TCP实现电芯数据毫秒级采集与存储

深度剖析KrkrzExtract：新一代krkrz引擎资源处理技术利器

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

Why this needs its own a2 pattern

Workspaces and ownership edges

Stable schedule

SharedL0Crule

Counter layout

Vec-resident persistent state

Critical scalar-state rule on a2

Delayedexpdiffhandling

Final vec accumulation

Validation target

Files to study

GHelper：释放华硕笔记本隐藏性能的轻量级控制神器

基于语义搜索的代码索引工具：从原理到部署实战

Observal：自托管AI编程智能体管理与可观测性平台实践

CANN ops-math 贡献指南

新能源汽车电池生产线实战：C#上位机+Modbus TCP实现电芯数据毫秒级采集与存储

深度剖析KrkrzExtract：新一代krkrz引擎资源处理技术利器

Shared`L0C`rule

Delayed`expdiff`handling