EACL 2026 大模型安全相关论文整理
会议信息: EACL 2026 (第19届欧洲计算语言学协会会议)
时间: 2026年3月24-29日
地点: 摩洛哥拉巴特 (Rabat, Morocco)
论文集: ACL Anthology - EACL 2026
整理日期: 2026年4月16日
一、越狱攻击 (Jailbreak Attacks)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models | Sarah Ball, Frauke Kreuter, Nina Panickssery | Long Papers |
| 2 | Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Wei Zhao, Zhe Li, Yige Li, Jun Sun | Findings |
| 3 | When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models | Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg | SRW |
二、对抗攻击与安全漏洞 (Adversarial Attacks & Vulnerabilities)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | Teams of LLM Agents can Exploit Zero-Day Vulnerabilities | Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel Kang | Long Papers |
| 2 | VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy | Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo | Findings |
| 3 | Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents | Daud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi Mao | Industry |
| 4 | Hacking Neural Evaluation Metrics with Single Hub Text | Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai | Short Papers |
三、安全防御与对齐 (Safety Defense & Alignment)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | Safety of Large Language Models Beyond English: A Systematic Literature Review of Risks, Biases, and Safeguards | Aleksandra Krasnodębska, Katarzyna Dziewulska, Karolina Seweryn, Maciej Chrabaszcz, Wojciech Kusa | Long Papers |
| 2 | The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs | Omar Mahmoud, Ali Khalil, Thommen George Karimpanal, Buddhika Laknath Semage, Santu Rana | Findings |
| 3 | CodeGuard: Improving LLM Guardrails in CS Education | Nishat Raihan, Noah Erdachew, Jayoti Devi, Joanna C. S. Santos, Marcos Zampieri | Findings |
| 4 | Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety | Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz Janz | SRW |
| 5 | The Clinical Fingerprint: Comparing the Rhetorical Integrity and Epistemic Safety of Human Physicians and Large Language Models | Bayram Ayadi | SRW |
| 6 | Enhancing User Safety: Context-Aware Detection of Offensive Query-Ad Pairs in Multimodal Search Advertising | Gaurav Kumar, Qiangjian Xi, Tanmaya Shekhar Dabral, Hooshang Ghasemi, Abishek Krishnamoorthy, Danqing Fu, Rui Min, Emilio R. Antunez, Zhongli Ding, Pradyumna Narayana | Industry |
四、有害内容检测与内容审核 (Harmful Content Detection & Moderation)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models’ Detection of Human Risky Health Behavior Content | Yunze Xiao, Tingyu He, Lionel Z. Wang, Yiming Ma, Xingyu Song, Xiaohang Xu, Mona T. Diab, Irene Li, Ka Chung Ng | Long Papers |
| 2 | Harmful Factuality: LLMs Correcting What They Shouldn’t | Mingchen Li, Hanzhi Zhang, Heng Fan, Junhua Ding, Yunhe Feng | Findings |
| 3 | Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs | Sewon Kim, Jiwon Kim, SeungWoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon | Findings |
| 4 | When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate Speech | Priyansh Singhal, Piyush Joshi | Short Papers |
| 5 | To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability Discrimination | Jing Ke, Zheyong Xie, Shaosheng Cao, Tong Xu, Enhong Chen | Short Papers |
| 6 | BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok | Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross M. Sonnenblick, Magdalayna Curry, Laura DAdamo, Lindsay Young, Stuart Murray, Kristina Lerman | Long Papers |
五、隐私与数据安全 (Privacy & Data Security)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | Auditing Language Model Unlearning via Information Decomposition | Anmol Goel, Alan Ritter, Iryna Gurevych | Long Papers |
| 2 | Detecting Training Data of Large Language Models via Expectation Maximization | Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, William Yang Wang | Long Papers |
| 3 | Personal Information Parroting in Language Models | Nishant Subramani, Kshitish Ghate, Mona T. Diab | Findings |
| 4 | The Model’s Language Matters: A Comparative Privacy Analysis of LLMs | Abhishek Kumar Mishra, Antoine Boutet, Lucas Magnana | Findings |
| 5 | Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs | Honghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Lionel Ni, Jian Guo | Findings |
| 6 | OD-Stega: LLM-Based Relatively Secure Steganography via Optimized Distributions | Yu-Shin Huang, Peter Just, Hanyun Yin, Krishna Narayanan, Ruihong Huang, Chao Tian | Long Papers |
六、偏见与公平性 (Bias & Fairness)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | How Quantization Shapes Bias in Large Language Models | Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych | Long Papers |
| 2 | Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs | Zara Siddique, Irtaza Khalid, Liam Turner, Luis Espinosa-Anke | Findings |
| 3 | Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models | David Guzman Piedrahita, Irene Strauss, Rada Mihalcea, Zhijing Jin | Long Papers |
| 4 | Do Political Opinions Transfer Between Western Languages? | Franziska Weeber, Tanise Ceron, Sebastian Padó | Long Papers |
| 5 | Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models | Sumanth Manduru, Carlotta Domeniconi | SRW |
| 6 | Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors | Adnan Al Ali, Jindřich Helcl, Jindřich Libovický | SRW |
| 7 | SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context | Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev | Short Papers |
| 8 | MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment | Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Jolie Nguyen, Shivank Garg, Kevin Zhu, Vasu Sharma | Short Papers |
| 9 | Common Sense or Ableism? Rethinking Commonsense Reasoning Through the Lens of Disability | Karina H Halevy, Kimi Wenzel, Seyun Kim, Kyle Dean Bauer, Bruno Neira, Mona T. Diab, Maarten Sap | Short Papers |
| 10 | On the Interplay between Human Label Variation and Model Fairness | Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau | Findings |
七、虚假信息与舆论操控 (Misinformation & Manipulation)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media | Michele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa Cotovio, Marta Vázquez Abuín, Javier Parapar, Pablo Gamallo | Long Papers |
| 2 | Entity-aware Cross-lingual Claim Detection for Automated Fact-checking | Rrubaa Panchendrarajan, Arkaitz Zubiaga | Findings |
| 3 | ART: Adaptive Reasoning Trees for Explainable Claim Verification | Sahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue Wu | Findings |
| 4 | Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels | Yuki Kishi, Yuji Arima, Hitoshi Iyatomi | SRW |
| 5 | Tailoring Rumor Debunking to You: Diversifying Chinese Rumor-Debunking Passages with an LLM-Driven Simulated Feedback-Enhanced Framework | Xinle Pang, Danding Wang, Qiang Sheng, Yifan Sun, Beizhe Hu, Juan Cao | Industry |
八、Agent安全与多智能体安全 (Agent & Multi-Agent Safety)
| # | 论文标题 | 作者 | 来源 |
|---|---|---|---|
| 1 | MAPS: A Multilingual Benchmark for Agent Performance and Security | Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein | Findings |
| 2 | The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems | Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun Qi | Industry |
| 3 | Don’t Trust Generative Agents to Mimic Communication on Social Networks | (详见ACL Anthology) | Long Papers |
统计总览
| 类别 | 论文数量 |
|---|---|
| 越狱攻击 | 3 |
| 对抗攻击与安全漏洞 | 4 |
| 安全防御与对齐 | 6 |
| 有害内容检测与内容审核 | 6 |
| 隐私与数据安全 | 6 |
| 偏见与公平性 | 10 |
| 虚假信息与舆论操控 | 5 |
| Agent安全与多智能体安全 | 3 |
| 总计 | 43 |
注: 部分论文可能跨越多个类别,此处按最主要的研究方向进行分类。论文来源标注说明:Long Papers = 主会长文 (Vol.1), Short Papers = 主会短文 (Vol.2), Findings = Findings of ACL, SRW = 学生研究研讨会 (Vol.4), Industry = 工业界 (Vol.5)。