DeepContext Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs-平芜编程栈

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

Authors:Justin Albrethsen, Yash Datta, Kunal Kumar, Sharath Rajasekar

Deep-Dive Summary:
以下是该学术论文相关部分的中文总结：

DeepContext：LLM多轮对抗意图漂移的有状态实时检测

摘要

随着大语言模型（LLM）能力的提升，其安全护栏在很大程度上仍保持“无状态”模式，即将多轮对话视为一系列孤立事件。这种时间感知能力的缺失导致了“安全间隙”，使得如 Crescendo 和 ActorAttack 等对抗策略能够跨越对话边界缓慢渗透恶意意图，从而绕过无状态过滤器。本文提出了DeepContext，一个旨在映射用户意图时间轨迹的有状态监控框架。DeepContext 抛弃了孤立评估模型，转而采用循环神经网络（RNN）架构，摄取一系列微调后的单轮嵌入（turn-level embeddings）。通过在对话中传递隐藏状态，DeepContext 能够捕获无状态模型容易忽略的风险增量累积。

实验结果表明，DeepContext 在多轮越狱检测方面显著优于现有基准，达到了0.84的 F1 分数，相比超大规模云供应商护栏及领先的开源模型（如 Llama-Prompt-Guard-2 和 Granite-Guardian 的 0.67）有显著提升。此外，DeepContext 在 T4 GPU 上的推理开销保持在20ms 以下，确保了实时应用的可行性。

1. 引言

LLM 作为智能体被广泛部署，催生了对稳健实时安全机制的需求。虽然早期防御成功缓解了“单次”越狱攻击，但攻击者已演变出“上下文碎片化”策略。此类攻击利用 LLM 的自回归特性，将恶意意图分散在看似良性的查询序列中。

1.1 现有防御的计算瓶颈

当前的多轮防御通常涉及“对话历史拼接”，即将整个对话历史重新注入 7B+ 参数的护栏模型（如 Llama Guard）。由于自注意力的二次复杂度，这种方法在实时应用中无法扩展，且难以捕捉多轮诱导中的时间漂移。

1.2 DeepContext：通过循环意图追踪实现有状态性

DeepContext 将范式从重复的大模型推理转变为精简的循环意图追踪架构：

单轮嵌入提取：使用轻量级编码器生成单轮语义嵌入。
RNN 驱动的状态估计：维护随对话演化的持久隐藏状态，充当“上下文记忆”。
识别意图演化：通过监控隐藏状态的转换来识别“意图漂移”。
高精度、低延迟：在 T4 GPU 上实现 19ms 延迟和 0.84 的 F1 分数。

2. 相关工作

对抗环境已从单轮优化注入转向利用模型推理能力和上下文窗口的复杂、有状态策略（如 RACE、Crescendo、ActorAttack）。现有的护栏（如 Llama Guard 4, Granite Guardian 3.3）由于缺乏时间推理能力，存在“上下文盲区”。DeepContext 借鉴了 JavelinGuard 的架构，通过任务注意力加权嵌入和 RNN 实现状态追踪，填补了这一空白。

3. 方法论：基于循环潜在嵌入的有状态意图追踪

3.1 问题形式化：对抗累积

我们将对话历史定义为H t = { ( u 1 , r 1 ) , … , u t } \mathcal{H}_{t} = \{(u_{1},r_{1}),\ldots ,u_{t}\}Ht={(u1,r1),…,ut}。DeepContext 将安全检测重构为状态空间问题，其隐藏意图状态h t h_tht递归更新：

h t = R N N ( h t − 1 , e t ) ( 1 ) h_{t} = \mathrm{RNN}(h_{t - 1},e_{t}) \quad (1)ht=RNN(ht−1,et)(1)

最终风险向量R t R_tRt由投影后的隐藏状态和当前任务嵌入拼接而成：

R t = [ ϕ ( h t ) ; e t ] y t = M L P ( R t ) ( 2 ) \begin{array}{l}{R_{t} = [\phi (h_{t});e_{t}]}\\ {y_{t} = \mathrm{MLP}(R_{t})} \end{array} \quad (2)Rt=[ϕ(ht);et]yt=MLP(Rt)(2)

3.2 模型架构：DeepContext 流水线

DeepContext 包含三个主要模块：

任务注意力加权编码器：使用微调的 BERT 编码器，通过任务注意力机制优先处理高信号的对抗标记。
基于 GRU 的循环意图追踪：采用门控循环单元（GRU）作为计算效率高的序列建模选择，以减少 FLOPs 并缓解梯度消失问题。

图 1：DeepContext 架构。流水线包含三个阶段：(1) 使用微调 BERT 进行单轮嵌入提取；(2) 通过 GRU 进行循环意图追踪；(3) 结合残差连接的轨迹分类器进行最终评分。

其状态更新逻辑如下：
z t = σ ( W z ⋅ e t + U z ⋅ h t − 1 + b z ) ( 4 ) z_{t} = \sigma (W_{z}\cdot e_{t} + U_{z}\cdot h_{t - 1} + b_{z}) \quad (4)zt=σ(Wz⋅et+Uz⋅ht−1+bz)(4)
r t = σ ( W r ⋅ e t + U r ⋅ h t − 1 + b r ) ( 5 ) r_{t} = \sigma (W_{r}\cdot e_{t} + U_{r}\cdot h_{t - 1} + b_{r}) \quad (5)rt=σ(Wr⋅et+Ur⋅ht−1+br)(5)
h t = ( 1 − z t ) ⊙ h t − 1 + z t ⊙ h ~ t ( 7 ) h_{t} = (1 - z_{t})\odot h_{t - 1} + z_{t}\odot \tilde{h}_{t} \quad (7)ht=(1−zt)⊙ht−1+zt⊙h~t(7)

投影层与残差快捷方式：实现混合残差架构，将原始嵌入e t e_tet与投影后的隐藏状态拼接。这确保了系统既能检测“慢燃型”诱导攻击，也能识别瞬时的单轮攻击。
轨迹分类器：使用多层感知器（MLP）和 Sigmoid 函数输出危害概率。

3.2.5 训练与数据集

训练语料包含约 43.7 万条对话序列（20% 为恶意）。采用单 epoch 训练以防止过拟合，并使用Focal Loss来处理类别不平衡：

L = − ( 1 − p t ) γ log ⁡ ( p t ) ( 10 ) \mathcal{L} = -(1 - p_{t})^{\gamma}\log (p_{t}) \quad (10)L=−(1−pt)γlog(pt)(10)

4. 结果与评估

4.1 评估数据集

基准测试集涵盖了良性对话（LMSYS, Anthropic HH-RLHF）和多种对抗攻击（HarmBench, Red Queen, Crescendo 等）。包含 210 个多轮越狱样本，中位对话轮数为 7 轮。

4.2 评估基线

对比对象包括轻量级编码器（Llama-Prompt-Guard-2）、生成式护栏（Granite-Guardian, Llama-Guard-4）以及云端托管方案（Azure Prompt Shield, AWS, GCP）。

4.3 多轮表现分析

DeepContext 以0.84的 F1 分数占据主导地位，领先第二名约 25%。

检测延迟 (MTTD)：DeepContext 平均在4.24 轮即可识别威胁，而云端方案（如 Azure）则表现出严重的滞后（8.00 轮）。
大模型的上下文盲目：如 Llama-Guard-4-12B 等模型在长对话中表现不佳，因为对抗信号容易被冗长的良性前言稀释。

表 4：多轮越狱检测性能 (N = 1,010)

模型	F1 Score ↑	召回率 (Recall)	精确率 (Precision)	MTTD
DeepContext (本文)	0.84	0.83	0.86	4.24
Llama-Prompt-Guard-2	0.67	0.60	0.76	5.83
Granite-Guardian-3.3	0.67	0.57	0.83	5.03

4.4 单轮越狱表现

在单轮基准 JailBreakBench 上，DeepContext 依然以0.98的 F1 分数名列第一，证明其残差架构能有效兼顾单轮攻击检测。

4.5 推理延迟与计算效率

在 T4 GPU 上，DeepContext 的单轮平均推理延迟仅为19ms，远快于 Granite-Guardian (125ms) 或云端托管护栏（如 AWS 的 235ms）。这种高效性源于其紧凑的循环状态架构，避免了由于重新处理完整对话历史而产生的计算扩展问题。

Original Abstract:While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a “Safety Gap” where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models.

PDF Link:2602.16935v1