news 2026/5/9 15:18:52

CANNBot Skills A2三重桥接模式

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANNBot Skills A2三重桥接模式

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

  • one cube stage that produces a score tile
  • vec logic that updates running row state and emits a delayed cube input
  • a later cube stage that consumes that delayed tile
  • a final vec stage that accumulates the delayed cube output

Typical target formula:

  • score_j = q.float() @ k_j.float().t() * scale
  • curr_m = maximum(prev_m, rowmax(score_j))
  • expdiff_j = exp(prev_m - curr_m)
  • p_j = exp(score_j - curr_m).half()
  • pv_j = p_j.float() @ v_j.float()
  • out = out * expdiff_j + pv_j

This isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.

Why this needs its own a2 pattern

This topology combines all a2 bridge constraints in one kernel:

  • cube -> vec cannot usel0c_to_ub
  • vec -> cube cannot useub_to_l1_*
  • the delayed cube output must return to vec for the final accumulation

So the stable data path is:

GM(q,k,v) -> L1 -> L0 -> L0C(score) -> GM(score_ws) -> UB(score)-> GM(p_ws) -> L1 -> L0 -> L0C(pv) -> GM(pv_ws) -> UB(pv) -> UB(accum) -> GM(out)

Use explicit workspaces instead of pretending this can stay on chip end-to-end.

Workspaces and ownership edges

Use three GM workspaces:

  1. score_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:L0C(score)->UB(score)
  2. p_ws

    • dtype:half
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:UB(p_j)->L1(p_j)
  3. pv_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, D]
    • purpose:L0C(pv_j)->UB(pv_j)

Ownership edges:

  • stage 1 cube -> vec:CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)
  • stage 1 vec -> stage 2 cube:VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)
  • stage 2 cube -> stage 3 vec:CvMutex(2, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Stable schedule

Use one-tile lookahead:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce tile j = ni if ni > 0: # stage 2 + stage 3: consume tile j = ni - 1

This gives:

  • warmup: first iteration only produces
  • steady state: producejwhile consumingj - 1
  • drain: final iteration only consumes the last delayed tile

SharedL0Crule

Reuse one physicalL0Cfamily across the two cube stages.

Why this is the stable a2 choice here:

  • stage 1 writes a full float[TILE_M, TILE_N]score tile
  • stage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD == 128
  • a2 only has128 KBL0C, so a second full float family would be a misleading design target

Stable ownership story:

  • keep onel0c = DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)
  • let stage 1 publishscore_wsbefore stage 2 reuses that slot
  • let stage 2 publishpv_wsbefore the next stage-1 reuse
  • advance one sharedl0c_cnt

This is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.

Counter layout

Keep these lifetimes separate:

  • l1qk_cnt: stage-1q/kloads
  • l1pv_cnt: stage-2p/vloads
  • l0c_cnt: shared physicalL0Cfamily across the two cube stages
  • stage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiff
  • stage2_cnt: delayed slot rhythm forp_wsconsumption andpv_ws

Do not hide the delayed accumulator lifetime behindstage1_cnt.

Vec-resident persistent state

Keep these values in per-subblock UB across the whole inner loop:

  • running row max:[HALF_M, 1]
  • delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)
  • final numerator accumulation:[HALF_M, D]

UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.

Critical scalar-state rule on a2

Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.

Reason:

  • ub_to_ubinfers burst length in units ofC0blocks
  • for[64, 1]float views, that means copying 8 elements per row
  • this silently miscopies row-scalar state such asprev_m

Stable fix:

  • keep scalar state in[HALF_M, 1]
  • copy it with a vec binary op that respects the[M,1]stride model, for example:
dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)

Then update or transform that copied buffer with more vec ops.

Delayedexpdiffhandling

expdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.

Stable pattern:

  1. stage 1 copiesprev_minto the delayedexpdiffslot
  2. stage 1 updates running max
  3. stage 1 overwrites the delayed slot withexp(prev_m - curr_m)
  4. stage 3 later reads that same slot and broadcasts it before scalingaccum

Usestage1_cntparity for the write slot andstage2_cntparity for the read slot.

Final vec accumulation

After loadingpv_jback into UB:

  1. brcbthe delayedexpdiffslot to[HALF_M, 8]
  2. scaleaccum[:, 0:64]
  3. scaleaccum[:, 64:128]
  4. add(accum, accum, pv_j)

Why sliced scaling is required:

  • accumis wide ([HALF_M, 128])
  • expdiffbroadcast is narrow ([HALF_M, 8])
  • follow the same sliced-row rule used for row-max subtraction

Validation target

Keep the first validated contract narrow:

  • D == 128
  • S1 % 128 == 0
  • S2 % 128 == 0
  • inputq/k/varefloat16
  • output isfloat32

Suggested cases:

  1. (1, 1, 256, 512, 128)
  2. (1, 3, 256, 512, 128)
  3. (1, 3, 2048, 4096, 128)

Files to study

  • agent/example/kernels/a2/flash_attn_score_iter.py
  • agent/example/kernels/a2/flash_attn_score_pv.py
  • agent/example/kernels/a2/flash_attn_unnorm.py
  • agent/references/patterns/a2-cube-vec.md
  • agent/references/patterns/a2-cube-vec-cube.md
  • agent/references/constraints/a2-device.md
  • agent/references/constraints/vec-reduction-a2.md
  • agent/references/constraints/vec-stride.md

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 15:17:57

GHelper:释放华硕笔记本隐藏性能的轻量级控制神器

GHelper&#xff1a;释放华硕笔记本隐藏性能的轻量级控制神器 【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, ProArt, Vivobook, Zenbook, Expert…

作者头像 李华
网站建设 2026/5/9 15:17:19

基于语义搜索的代码索引工具:从原理到部署实战

1. 项目概述&#xff1a;一个为代码库建立智能索引的利器最近在折腾个人项目和团队协作时&#xff0c;我遇到了一个挺普遍但很头疼的问题&#xff1a;随着代码库规模越来越大&#xff0c;文件越来越多&#xff0c;想要快速找到一个特定的函数定义、某个类的引用&#xff0c;或者…

作者头像 李华
网站建设 2026/5/9 15:15:12

Observal:自托管AI编程智能体管理与可观测性平台实践

1. 项目概述&#xff1a;一个为AI编程智能体打造的“Docker Hub”如果你和我一样&#xff0c;最近几个月被各种AI编程助手&#xff08;Agent&#xff09;搞得眼花缭乱——Claude Code、Cursor、Kiro CLI、GitHub Copilot……每个工具都有自己的配置、提示词、MCP服务器和技能包…

作者头像 李华
网站建设 2026/5/9 15:14:32

CANN ops-math 贡献指南

贡献指南 【免费下载链接】ops-math 本项目是CANN提供的数学类基础计算算子库&#xff0c;实现网络在NPU上加速计算。 项目地址: https://gitcode.com/cann/ops-math 本项目欢迎广大开发者体验并参与贡献&#xff0c;在参与社区贡献之前&#xff0c;请参见cann-communit…

作者头像 李华
网站建设 2026/5/9 15:12:45

新能源汽车电池生产线实战:C#上位机+Modbus TCP实现电芯数据毫秒级采集与存储

上个月在天津滨海新区的一家新能源电池生产企业做项目,他们的电芯装配线需要一套实时数据采集系统——要对接产线上的12台PLC,读取每个电芯的电压、温度、内阻、极耳焊接质量等20多项数据,采集周期要求100ms,数据要同时存SQL Server做业务追溯和InfluxDB做实时看板。之前他…

作者头像 李华
网站建设 2026/5/9 15:12:30

深度剖析KrkrzExtract:新一代krkrz引擎资源处理技术利器

深度剖析KrkrzExtract&#xff1a;新一代krkrz引擎资源处理技术利器 【免费下载链接】KrkrzExtract The next generation of KrkrExtract 项目地址: https://gitcode.com/gh_mirrors/kr/KrkrzExtract 在游戏开发与逆向工程领域&#xff0c;专业化的资源处理工具往往决定…

作者头像 李华