news 2026/5/26 23:04:56

AI 基础知识十三 Transformer注意力机制(Attention)

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
AI 基础知识十三 Transformer注意力机制(Attention)

注意力机制

Transformer 的核心是自注意力多头注意力,让序列每个位置都能动态关注全局相关信息,并行捕捉长程依赖。

自注意力公式

多头注意力公式

计算步骤

参照论文说明

1. Q、 K矩阵相乘

2. 缩放处理

3. 加掩码处理 是可选项

4. Softmax 归一化指数函数

5. 与V矩阵相乘

本着简单的原则,用一个实例来说明Q,K,V计算过程

实例

自注意力

实现代码

例子"Welcome to Machine Learning Pad Pad"经过词嵌入位置编码,得到6X4矩阵,为了方便计算对这个矩阵手动设置特定的数据

/* {"Pad", 0}, {"Welcome", 1}, {"to", 2}, {"Machine", 3}, {"Learning", 4} 1. Welcome to Machine Learning Pad Pad -- > [1,2,3,4,0,0] 2. Embedding + PositionalEncoding -> x 3. x: [6 ,4] */ auto x = torch::tensor({ {{1.0, 0.0, 0.0, 0.0}, // Welcome {2.0, 0.0, 0.0, 0.0}, // to {3.0, 0.0, 0.0, 0.0}, // Machine {4.0, 0.0, 0.0, 0.0}, // Learning {0.0, 0.0, 0.0, 0.0}, // Pad {0.0, 0.0, 0.0, 0.0} // Pad } }, torch::kFloat);

Q,K,V是一组权重 它的词嵌入的维度为了方便计算都它们设定单位矩阵

class SelfAttention : public torch::nn::Module { public: SelfAttention() { } void InitQKV(int64_t dim) { auto linear = torch::nn::LinearOptions(dim, dim).bias(false); Q = register_module("q", torch::nn::Linear(linear)); K = register_module("k", torch::nn::Linear(linear)); V = register_module("v", torch::nn::Linear(linear)); norm_fact = 1.0 / sqrt(dim); // 缩放 auto onesw = torch::eye(dim); //单位矩阵 Q->weight.set_data(onesw); K->weight.set_data(onesw); V->weight.set_data(onesw); } torch::nn::Linear Q{ nullptr }; torch::nn::Linear K{ nullptr }; torch::nn::Linear V{ nullptr }; double norm_fact = 0 ; };

torch::nn::Transformer 要求输入张量形状[seq, batch, dim],这里简单化为[seq,dim]

参照论文实现计算步骤

auto forward(torch::Tensor x,torch::Tensor mask = {}) { torch::Tensor q ; torch::Tensor k ; torch::Tensor v; torch::Tensor kt; torch::Tensor out; auto dim = x.dim(); // x: [seq, dim] InitQKV(x.size(1)); /// 1.输入x 与 q k v 运算 q k v是 单位矩阵所以 q k v = x q = Q->forward(x); k = K->forward(x); v = V->forward(x); cout << "q k v \n" << q << endl; kt = k.transpose(0, 1);// kt 是 k 的置换矩阵 kt: [dim,seq] cout << "kt \n" << kt << endl; auto attn_score = torch::matmul(q, kt); //2. q:[seq, dim] X kt: [dim,seq] -> [seq, seq] cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; //3. 矩阵缩放 cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1);//4. Softmax 归一化指数函数 cout << "torch::softmax q.X.kt \n" << attn_score << endl; out = torch::matmul(attn_score, v); /// 5.与V矩阵相乘 [seq, seq] X v:[seq, dim] -> [seq, dim] cout << "torch::matmul V \n" << out << endl; return out; }

重点解析

1.==

实际意义是

矩阵相乘结果

每个字符都能其他字符产生运算,也就是它能根据上下文来确定语意,字符序列长度N,Transformer时间复杂度为

4. Softmax 归一化指数函数

数学公: 式 输入向量

=

第 i 个元素的 Softmax 输出为

每行内所有数据相加等于1, 原数据按一定比例缩小

5. 与V矩阵相乘

qkv现在全部建立关系了

当要求输入张量形状[seq, batch, dim]时,其流程都一样,要变换处理张量

高维张量矩阵相乘

公式:a[..,..., M,N] * b[...,...,N, K] = [..,...,M, K] 看到最后两维和两维矩阵相乘一样

整理之的代码,支持两三维输入张量

auto forward(torch::Tensor x,torch::Tensor mask = {}) { torch::Tensor q ; torch::Tensor k ; torch::Tensor v; torch::Tensor kt; torch::Tensor out; auto dim = x.dim(); if (dim == 3) { //x: [batch, seq, dim] ---> [seq, batch, dim] x = x.permute({1,0,2}); InitQKV(x.size(2)); } else { // x: [seq, dim] InitQKV(x.size(1)); } /// 1.输入x 与 q k v 运算 q k v是 单位矩阵所以 q k v = x q = Q->forward(x); k = K->forward(x); v = V->forward(x); cout << "q k v \n" << q << endl; if (dim == 3) { kt = k.permute({ 1,2,0 }); v = v.permute({ 1,0,2 }); } else { kt = k.transpose(0, 1);// kt 是 k 的置换矩阵 kt: [dim,seq] } cout << "kt \n" << kt << endl; auto attn_score = torch::matmul(q, kt); //2. cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; //3. 矩阵缩放 cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1);//4. Softmax 归一化指数函数 cout << "torch::softmax q.X.kt \n" << attn_score << endl; out = torch::matmul(attn_score, v); /// 5. qKt * V cout << "torch::matmul V \n" << out << endl; return out; }

多头注意力

在“自注意力”的基础上增加

1. 维度被为多份 分别用于Q K V 计算

2.将多份重新拼接

3.最后加输出投影

输入两维张量时,写一个函数forward

1. 输入张量x形状[seq, dim], q、k、v形状[seq, dim]

2. 将q、k、v拆分成[H, S, Dk] , seq简写S, H:头数量, Dk = dim/ H

q = q.view({ seq,H,Dk }); //q: [seq, dim] -> [S, H, Dk] k = k.view({ seq,H,Dk }); v = v.view({ seq,H,Dk }); q = q.permute({ 1,0,2 }); //[S, H, Dk] --->[H, S, Dk] k = k.permute({ 1,0,2 }); v = v.permute({ 1,0,2 });

3. 调形状[H, Dk, S]

auto kt = k.permute({ 0,2,1 }); //kt: [H, S, Dk] --> [H, Dk, S]

4. 与V矩阵相乘之后 输出形状[H, S, Dk],要转换成[S, H, Dk]

auto out = torch::matmul(attn_score, v); // [H, S, S] * [H, S, Dk] -> out: [H, S, Dk]

5.[S, H, Dk]拼接成[seq, dim],最后输出投影

out = out.transpose(1, 0).contiguous().view({ seq, dim }); // [H, S, Dk] --> [S, H, Dk] -> [seq, dim] cout << "torch::matmul QK * V \n" << out.squeeze() << endl; out = Wo->forward(out);

输入三维张量时,写一个函数forward2去实现,除了张量形状调整不同外其他都一样,实现细节只能看代码

auto forward(torch::Tensor x, int64_t head = 2, torch::Tensor mask = {}) { x.squeeze_(); //x: [batch, seq ,dim] --> [seq, dim] assert(x.dim() == 2); //x: [seq, dim] InitQKV(x.size(1), head); auto seq = x.size(0); auto dim = x.size(1); auto q = Q->forward(x); auto k = K->forward(x); auto v = V->forward(x); q = q.view({ seq,H,Dk }); //q: [seq, dim] -> [S, H, Dk] k = k.view({ seq,H,Dk }); v = v.view({ seq,H,Dk }); q = q.permute({ 1,0,2 }); //[S, H, Dk] --->[H, S, Dk] k = k.permute({ 1,0,2 }); v = v.permute({ 1,0,2 }); cout << "q k v \n" << q << endl; auto kt = k.permute({ 0,2,1 }); //kt: [H, S, Dk] --> [H, Dk, S] cout << "kt \n" << kt << endl; auto attn_score = torch::matmul(q, kt); // [H, S, Dk] * [H, Dk, S] cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1); /// attn_score: [H, S, S] cout << "torch::softmax q.X.kt \n" << attn_score.squeeze() << endl; auto out = torch::matmul(attn_score, v); // [H, S, S] * [H, S, Dk] -> out: [H, S, Dk] out = out.transpose(1, 0).contiguous().view({ seq, dim }); // [H, S, Dk] --> [S, H, Dk] -> [seq, dim] cout << "torch::matmul QK * V \n" << out.squeeze() << endl; out = Wo->forward(out); return out; } auto forward2(torch::Tensor x, int64_t head = 2,torch::Tensor mask = {}) { assert(x.dim() == 3); x = x.permute({ 1,0,2 }); // x: x: [batch, seq, dim]--> [seq, batch, dim] InitQKV(x.size(2), head); auto seq = x.size(0); auto batch = x.size(1); auto dim = x.size(2); auto q = Q->forward(x); auto k = K->forward(x); auto v = V->forward(x); q = q.view({ seq,batch,H,Dk}); //q: [seq, batch, dim] -> [S, B, H, Dk] k = k.view({ seq,batch,H,Dk }); v = v.view({ seq,batch,H,Dk }); q = q.permute({1,2,0,3}); //[S, B, H, Dk] --->[B, H, S, Dk] k = k.permute({ 1,2,0,3 }); v = v.permute({ 1,2,0,3 }); cout << "q k v \n" << q << endl; auto kt = k.permute({ 0,1,3,2}); //kt: [B, H, S, Dk] --> [B, H, Dk, S] cout << "kt \n" << kt.squeeze() << endl; auto attn_score = torch::matmul(q, kt); cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1); /// attn_score: [B, H, S, S] cout << "torch::softmax q.X.kt \n" << attn_score << endl; auto out = torch::matmul(attn_score, v); // [B, H, S, S] * [B, H, S, Dk] -> out: [B, H, S, Dk] out = out.transpose(1, 2).contiguous().view({ seq,batch, dim }); // [B, H, S, Dk] --> [B, S, H, Dk] -> [seq,batch, dim] cout << "torch::matmul QK * V \n" << out << endl; out = Wo->forward(out); return out; }

完整代码

#include <torch/torch.h> #include <iostream> #include <torch/serialize.h> #include <regex> //#include <iostream> #include <fstream> using namespace std; class FeedForwardNet : public torch::nn::Module { //Q = register_module("q", torch::nn::Linear(linear)); }; class SelfAttention : public torch::nn::Module { public: SelfAttention() { } void InitQKV(int64_t dim) { auto linear = torch::nn::LinearOptions(dim, dim).bias(false); Q = register_module("q", torch::nn::Linear(linear)); K = register_module("k", torch::nn::Linear(linear)); V = register_module("v", torch::nn::Linear(linear)); norm_fact = 1.0 / sqrt(dim); // 缩放 auto onesw = torch::eye(dim); //单位矩阵 Q->weight.set_data(onesw); K->weight.set_data(onesw); V->weight.set_data(onesw); } auto forward(torch::Tensor x,torch::Tensor mask = {}) { torch::Tensor q ; torch::Tensor k ; torch::Tensor v; torch::Tensor kt; torch::Tensor out; auto dim = x.dim(); if (dim == 3) { //x: [batch, seq, dim] ---> [seq, batch, dim] x = x.permute({1,0,2}); InitQKV(x.size(2)); } else { // x: [seq, dim] InitQKV(x.size(1)); } /// 1.输入x 与 q k v 运算 q k v是 单位矩阵所以 q k v = x q = Q->forward(x); k = K->forward(x); v = V->forward(x); cout << "q k v \n" << q << endl; if (dim == 3) { kt = k.permute({ 1,2,0 }); v = v.permute({ 1,0,2 }); } else { kt = k.transpose(0, 1);// kt 是 k 的置换矩阵 kt: [dim,seq] } cout << "kt \n" << kt << endl; auto attn_score = torch::matmul(q, kt); //2. cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; //3. 矩阵缩放 cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1);//4. Softmax 归一化指数函数 cout << "torch::softmax q.X.kt \n" << attn_score << endl; out = torch::matmul(attn_score, v); /// 5. qKt * V cout << "torch::matmul V \n" << out << endl; return out; } torch::nn::Linear Q{ nullptr }; torch::nn::Linear K{ nullptr }; torch::nn::Linear V{ nullptr }; double norm_fact = 0 ; }; class MultiHeadAttention: public torch::nn::Module { public: void InitQKV(int64_t dim, int64_t head=2) { assert(dim % head == 0); auto linear = torch::nn::LinearOptions(dim, dim).bias(false); Q = register_module("q", torch::nn::Linear(linear)); K = register_module("k", torch::nn::Linear(linear)); V = register_module("v", torch::nn::Linear(linear)); Wo = register_module("Wo", torch::nn::Linear(linear)); // 输出投影 norm_fact = 1.0 / sqrt(dim); Dk = dim / head; H = head; auto onesw = torch::eye(dim); Q->weight.set_data(onesw); K->weight.set_data(onesw); V->weight.set_data(onesw); Wo->weight.set_data(onesw); } auto forward(torch::Tensor x, int64_t head = 2, torch::Tensor mask = {}) { x.squeeze_(); //x: [batch, seq ,dim] --> [seq, dim] assert(x.dim() == 2); //x: [seq, dim] InitQKV(x.size(1), head); auto seq = x.size(0); auto dim = x.size(1); auto q = Q->forward(x); auto k = K->forward(x); auto v = V->forward(x); q = q.view({ seq,H,Dk }); //q: [seq, dim] -> [S, H, Dk] k = k.view({ seq,H,Dk }); v = v.view({ seq,H,Dk }); q = q.permute({ 1,0,2 }); //[S, H, Dk] --->[H, S, Dk] k = k.permute({ 1,0,2 }); v = v.permute({ 1,0,2 }); cout << "q k v \n" << q << endl; auto kt = k.permute({ 0,2,1 }); //kt: [H, S, Dk] --> [H, Dk, S] cout << "kt \n" << kt << endl; auto attn_score = torch::matmul(q, kt); // [H, S, Dk] * [H, Dk, S] cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1); /// attn_score: [H, S, S] cout << "torch::softmax q.X.kt \n" << attn_score.squeeze() << endl; auto out = torch::matmul(attn_score, v); // [H, S, S] * [H, S, Dk] -> out: [H, S, Dk] out = out.transpose(1, 0).contiguous().view({ seq, dim }); // [H, S, Dk] --> [S, H, Dk] -> [seq, dim] cout << "torch::matmul QK * V \n" << out.squeeze() << endl; out = Wo->forward(out); return out; } auto forward2(torch::Tensor x, int64_t head = 2,torch::Tensor mask = {}) { assert(x.dim() == 3); x = x.permute({ 1,0,2 }); // x: x: [batch, seq, dim]--> [seq, batch, dim] InitQKV(x.size(2), head); auto seq = x.size(0); auto batch = x.size(1); auto dim = x.size(2); auto q = Q->forward(x); auto k = K->forward(x); auto v = V->forward(x); q = q.view({ seq,batch,H,Dk}); //q: [seq, batch, dim] -> [S, B, H, Dk] k = k.view({ seq,batch,H,Dk }); v = v.view({ seq,batch,H,Dk }); q = q.permute({1,2,0,3}); //[S, B, H, Dk] --->[B, H, S, Dk] k = k.permute({ 1,2,0,3 }); v = v.permute({ 1,2,0,3 }); cout << "q k v \n" << q << endl; auto kt = k.permute({ 0,1,3,2}); //kt: [B, H, S, Dk] --> [B, H, Dk, S] cout << "kt \n" << kt.squeeze() << endl; auto attn_score = torch::matmul(q, kt); cout << "q X kt \n" << attn_score << endl; attn_score = attn_score * norm_fact; cout << "scale q.X.kt \n" << attn_score << endl; if (mask.defined()) { attn_score += mask; } attn_score = torch::softmax(attn_score, -1); /// attn_score: [B, H, S, S] cout << "torch::softmax q.X.kt \n" << attn_score << endl; auto out = torch::matmul(attn_score, v); // [B, H, S, S] * [B, H, S, Dk] -> out: [B, H, S, Dk] out = out.transpose(1, 2).contiguous().view({ seq,batch, dim }); // [B, H, S, Dk] --> [B, S, H, Dk] -> [seq,batch, dim] cout << "torch::matmul QK * V \n" << out << endl; out = Wo->forward(out); return out; } torch::nn::Linear Q{ nullptr }; torch::nn::Linear K{ nullptr }; torch::nn::Linear V{ nullptr }; torch::nn::Linear Wo{ nullptr }; double norm_fact = 0; int64_t Dk; int64_t H; }; void TransformerAttentionMain() { auto x = torch::tensor({ {{1.0, 0.0, 0.0, 0.0}, // Welcome {2.0, 0.0, 0.0, 0.0}, // to {3.0, 0.0, 0.0, 0.0}, // Machine {4.0, 0.0, 0.0, 0.0}, // Learning {0.0, 0.0, 0.0, 0.0}, // Pad {0.0, 0.0, 0.0, 0.0} // Pad } }, torch::kFloat); auto w = torch::tensor({ { {0.0, 1.0, 0.0, 0.0}, {0.0, 2.0, 0.0, 0.0}, {0.0, 3.0, 0.0, 0.0}, {0.0, 4.0, 0.0, 0.0} } }, torch::kFloat); cout << "input\n" << x << endl; cout << "-------------SelfAttention--------------------\n" << endl; auto x1 = x.squeeze(); auto atten= SelfAttention(); auto y = atten.forward(x1); cout << "-------------SelfAttention--------------------\n" << endl; cout << "\n\n-------------MultiHeadAttention--------------------\n" << endl; auto multiAtten = MultiHeadAttention(); multiAtten.forward2(x,1); cout << "-------------MultiHeadAttention--------------------\n" << endl; }

感谢大家的支持,如要问题欢迎提问指正。

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/23 1:49:43

Qwen3-14B私有部署案例:电商客服话术生成与情感倾向优化实践

Qwen3-14B私有部署案例&#xff1a;电商客服话术生成与情感倾向优化实践 1. 项目背景与需求分析 电商客服每天需要处理大量重复性问题&#xff0c;传统人工回复效率低下且难以保证一致性。我们基于Qwen3-14B模型构建了智能客服话术生成系统&#xff0c;主要解决以下痛点&…

作者头像 李华
网站建设 2026/5/23 1:49:44

突破多模态开发进阶三大瓶颈

随着2026年多模态技术的普及&#xff0c;越来越多开发者从“API调用”入门&#xff0c;却在进阶过程中陷入瓶颈&#xff1a;调用公共API有额度限制、生成效果不符合场景需求、本地化部署卡顿报错、模型微调无从下手……这些问题&#xff0c;成为开发者从“会用”到“精通”的最…

作者头像 李华
网站建设 2026/5/23 1:49:40

【洛谷P1000】

# 【题解】洛谷 P1000 超级玛丽游戏 ## 题目链接 [P1000 超级玛丽游戏](https://www.luogu.com.cn/problem/P1000)## 题目描述 本题要求你输出一个超级玛丽的图案&#xff0c;只需要按照题目给出的样例原样输出即可。## 输入格式 无## 输出格式 题目给出的超级玛丽图案。## 样例…

作者头像 李华
网站建设 2026/5/23 1:50:49

javaweb数字化高校宿舍报修出入登记调换宿舍管理系统的实现

目录同行可拿货,招校园代理 ,本人源头供货商功能模块分析技术实现要点扩展性设计项目技术支持源码获取详细视频演示 &#xff1a;文章底部获取博主联系方式&#xff01;同行可合作同行可拿货,招校园代理 ,本人源头供货商 功能模块分析 宿舍报修管理 学生在线提交报修申请&am…

作者头像 李华
网站建设 2026/5/26 11:03:59

我的项目复盘,以及踩过的雷点

智慧工程安全系统项目开发总结与问题复盘一、项目概述本项目聚焦工程施工场景的安全管理需求&#xff0c;开发了集工人安全检测、零件缺陷检测、安全智慧助手为一体的智慧工程安全系统&#xff0c;依托Python语言、Flask框架搭建前后端交互体系&#xff0c;结合Ultralytics YOL…

作者头像 李华