作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

大型语言模型自检索再生成越狱攻击

  • 发布日期:2025-08-15

LLM Jailbreaks Itself with Self-Retrieval and Re-Generation

  • Published:2025-08-15

摘要: 型语言模型(Large Language Model, LLM)在各个领域展现出了卓越的性能。然而,其安全性漏洞,特别是越狱攻击(jailbreak attack)问题,引发了广泛关注。本研究提出了一种新颖且更易实施的间接越狱攻击方法,称为自检索诱导自越狱(SRIS)。该方法利用 LLM 自身的知识检索能力,基于其内部生成的信息,引导模型输出有害内容。SRIS不依赖外部知识,使越狱攻击更加可行且更易执行。在七个当前最先进的 LLM 上进行了广泛实验。实验结果表明,SRIS 在攻击成功率方面显著优于现有方法,最高可达 74.76%(GPT-3.5)和 56.8%(GPT-4)。在大多数问题领域中,SRIS 方法的攻击成功率显著领先,表现出较强的稳定性和广泛适用性。研究结果凸显了在 LLM 训练过程中谨慎选择训练数据的必要性。我们呼吁进一步研究更安全的开发实践,以提升 LLM 在实际应用中的安全性和可靠性。

Abstract: Large Language Model(LLM) has demonstrated remarkable performance across various domains. However, their security vulnerabilities, particularly regarding jailbreak attacks, have raised significant concerns. In this work, we propose a novel and more practical indirect jailbreak attack method, named Self-Retrieval Induced Self-Jailbreak (SRIS). It retrieves knowledge from within the LLM itself, utilizing internally generated information to induce the model to produce harmful responses. our approach does not rely on external knowledge, making the jailbreak attack more feasible and easier to execute. We have conducted extensive experiments on seven state-of-the-art LLMs. The experimental results show that SRIS significantly outperforms existing methods in attack success rates, with a maximum success rate of 74.76% on GPT-3.5 and 56.8% on GPT-4. In most of the question domains, SRIS Significantly outperforms others, highlighting robustness and broad applicability. Our findings emphasize the importance of careful training data selection for LLMs. We advocate for further research into safer development practices to improve the overall security and reliability of LLMs in practical applications.