news 2026/4/16 21:24:11

EACL 2026 大模型安全相关论文整理

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
EACL 2026 大模型安全相关论文整理

EACL 2026 大模型安全相关论文整理

会议信息: EACL 2026 (第19届欧洲计算语言学协会会议)
时间: 2026年3月24-29日
地点: 摩洛哥拉巴特 (Rabat, Morocco)
论文集: ACL Anthology - EACL 2026
整理日期: 2026年4月16日


一、越狱攻击 (Jailbreak Attacks)

#论文标题作者来源
1Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language ModelsSarah Ball, Frauke Kreuter, Nina PanicksseryLong Papers
2Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsWei Zhao, Zhe Li, Yige Li, Jun SunFindings
3When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language ModelsZafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank GargSRW

二、对抗攻击与安全漏洞 (Adversarial Attacks & Vulnerabilities)

#论文标题作者来源
1Teams of LLM Agents can Exploit Zero-Day VulnerabilitiesYuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel KangLong Papers
2VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User PrivacyYu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong ZuoFindings
3Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling AgentsDaud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi MaoIndustry
4Hacking Neural Evaluation Metrics with Single Hub TextHiroyuki Deguchi, Katsuki Chousa, Yusuke SakaiShort Papers

三、安全防御与对齐 (Safety Defense & Alignment)

#论文标题作者来源
1Safety of Large Language Models Beyond English: A Systematic Literature Review of Risks, Biases, and SafeguardsAleksandra Krasnodębska, Katarzyna Dziewulska, Karolina Seweryn, Maciej Chrabaszcz, Wojciech KusaLong Papers
2The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMsOmar Mahmoud, Ali Khalil, Thommen George Karimpanal, Buddhika Laknath Semage, Santu RanaFindings
3CodeGuard: Improving LLM Guardrails in CS EducationNishat Raihan, Noah Erdachew, Jayoti Devi, Joanna C. S. Santos, Marcos ZampieriFindings
4Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and SafetyDenis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz JanzSRW
5The Clinical Fingerprint: Comparing the Rhetorical Integrity and Epistemic Safety of Human Physicians and Large Language ModelsBayram AyadiSRW
6Enhancing User Safety: Context-Aware Detection of Offensive Query-Ad Pairs in Multimodal Search AdvertisingGaurav Kumar, Qiangjian Xi, Tanmaya Shekhar Dabral, Hooshang Ghasemi, Abishek Krishnamoorthy, Danqing Fu, Rui Min, Emilio R. Antunez, Zhongli Ding, Pradyumna NarayanaIndustry

四、有害内容检测与内容审核 (Harmful Content Detection & Moderation)

#论文标题作者来源
1JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models’ Detection of Human Risky Health Behavior ContentYunze Xiao, Tingyu He, Lionel Z. Wang, Yiming Ma, Xingyu Song, Xiaohang Xu, Mona T. Diab, Irene Li, Ka Chung NgLong Papers
2Harmful Factuality: LLMs Correcting What They Shouldn’tMingchen Li, Hanzhi Zhang, Heng Fan, Junhua Ding, Yunhe FengFindings
3Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMsSewon Kim, Jiwon Kim, SeungWoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo YoonFindings
4When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate SpeechPriyansh Singhal, Piyush JoshiShort Papers
5To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability DiscriminationJing Ke, Zheyong Xie, Shaosheng Cao, Tong Xu, Enhong ChenShort Papers
6BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTokMinh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross M. Sonnenblick, Magdalayna Curry, Laura DAdamo, Lindsay Young, Stuart Murray, Kristina LermanLong Papers

五、隐私与数据安全 (Privacy & Data Security)

#论文标题作者来源
1Auditing Language Model Unlearning via Information DecompositionAnmol Goel, Alan Ritter, Iryna GurevychLong Papers
2Detecting Training Data of Large Language Models via Expectation MaximizationGyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, William Yang WangLong Papers
3Personal Information Parroting in Language ModelsNishant Subramani, Kshitish Ghate, Mona T. DiabFindings
4The Model’s Language Matters: A Comparative Privacy Analysis of LLMsAbhishek Kumar Mishra, Antoine Boutet, Lucas MagnanaFindings
5Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMsHonghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Lionel Ni, Jian GuoFindings
6OD-Stega: LLM-Based Relatively Secure Steganography via Optimized DistributionsYu-Shin Huang, Peter Just, Hanyun Yin, Krishna Narayanan, Ruihong Huang, Chao TianLong Papers

六、偏见与公平性 (Bias & Fairness)

#论文标题作者来源
1How Quantization Shapes Bias in Large Language ModelsFederico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna GurevychLong Papers
2Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMsZara Siddique, Irtaza Khalid, Liam Turner, Luis Espinosa-AnkeFindings
3Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language ModelsDavid Guzman Piedrahita, Irene Strauss, Rada Mihalcea, Zhijing JinLong Papers
4Do Political Opinions Transfer Between Western Languages?Franziska Weeber, Tanise Ceron, Sebastian PadóLong Papers
5Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language ModelsSumanth Manduru, Carlotta DomeniconiSRW
6Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT DetectorsAdnan Al Ali, Jindřich Helcl, Jindřich LibovickýSRW
7SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African ContextAishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa DevShort Papers
8MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption AlignmentSagarika Banerjee, Tangatar Madi, Advait Swaminathan, Jolie Nguyen, Shivank Garg, Kevin Zhu, Vasu SharmaShort Papers
9Common Sense or Ableism? Rethinking Commonsense Reasoning Through the Lens of DisabilityKarina H Halevy, Kimi Wenzel, Seyun Kim, Kyle Dean Bauer, Bruno Neira, Mona T. Diab, Maarten SapShort Papers
10On the Interplay between Human Label Variation and Model FairnessKemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han LauFindings

七、虚假信息与舆论操控 (Misinformation & Manipulation)

#论文标题作者来源
1PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European MediaMichele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa Cotovio, Marta Vázquez Abuín, Javier Parapar, Pablo GamalloLong Papers
2Entity-aware Cross-lingual Claim Detection for Automated Fact-checkingRrubaa Panchendrarajan, Arkaitz ZubiagaFindings
3ART: Adaptive Reasoning Trees for Explainable Claim VerificationSahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue WuFindings
4Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained LabelsYuki Kishi, Yuji Arima, Hitoshi IyatomiSRW
5Tailoring Rumor Debunking to You: Diversifying Chinese Rumor-Debunking Passages with an LLM-Driven Simulated Feedback-Enhanced FrameworkXinle Pang, Danding Wang, Qiang Sheng, Yifan Sun, Beizhe Hu, Juan CaoIndustry

八、Agent安全与多智能体安全 (Agent & Multi-Agent Safety)

#论文标题作者来源
1MAPS: A Multilingual Benchmark for Agent Performance and SecurityOmer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman VainshteinFindings
2The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent SystemsDevang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun QiIndustry
3Don’t Trust Generative Agents to Mimic Communication on Social Networks(详见ACL Anthology)Long Papers

统计总览

类别论文数量
越狱攻击3
对抗攻击与安全漏洞4
安全防御与对齐6
有害内容检测与内容审核6
隐私与数据安全6
偏见与公平性10
虚假信息与舆论操控5
Agent安全与多智能体安全3
总计43

: 部分论文可能跨越多个类别,此处按最主要的研究方向进行分类。论文来源标注说明:Long Papers = 主会长文 (Vol.1), Short Papers = 主会短文 (Vol.2), Findings = Findings of ACL, SRW = 学生研究研讨会 (Vol.4), Industry = 工业界 (Vol.5)。

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/16 21:19:27

从零搭建微信公众号智能交互后台:Python Flask实战指南

1. 为什么需要自建微信公众号后台? 每次在公众号后台看到用户发来的消息,你是不是也遇到过这样的烦恼?官方后台的关键词回复规则太死板,稍微复杂点的需求就实现不了。比如用户发"查天气 北京",你想根据城市名…

作者头像 李华
网站建设 2026/4/16 21:18:15

J-Link驱动切换神器:USBDriverTool比Zadig更适合OpenOCD调试的3个理由

J-Link驱动切换神器:USBDriverTool比Zadig更适合OpenOCD调试的3个理由 当你在深夜调试嵌入式系统时,突然看到LIBUSB_ERROR_NOT_SUPPORTED的红色警告,而Keil却能正常识别J-Link——这种割裂体验每个嵌入式开发者都经历过。传统方案推荐使用Zad…

作者头像 李华
网站建设 2026/4/16 21:17:13

四线式I2C接口设计:提升抗噪能力与降低BOM成本的实践指南

1. 四线式I2C接口设计入门指南 第一次接触四线式I2C时,我也被这个看似复杂的设计搞糊涂了。明明传统I2C只需要两根线(SDA和SCL),为什么还要搞出四线版本?直到我在一个工业现场调试设备时,遇到了频繁的通信中…

作者头像 李华
网站建设 2026/4/16 21:14:23

Video2X:开源AI视频增强终极指南,让模糊视频变高清流畅

Video2X:开源AI视频增强终极指南,让模糊视频变高清流畅 【免费下载链接】video2x A machine learning-based video super resolution and frame interpolation framework. Est. Hack the Valley II, 2018. 项目地址: https://gitcode.com/GitHub_Trend…

作者头像 李华
网站建设 2026/4/16 21:11:14

模糊函数在雷达信号处理中的核心作用与实现解析

1. 模糊函数:雷达信号处理的"火眼金睛" 想象一下你在漆黑的夜晚用手电筒寻找目标。如果手电光束太宽,你会看到一片模糊的光斑;如果光束又细又准,就能清晰定位目标。模糊函数在雷达中的作用,就像这个手电筒的…

作者头像 李华