大型语言模型自检索再生成越狱攻击

doi:10.19678/j.issn.1000-3428.0252266

摘要/Abstract

摘要： 型语言模型（Large Language Model, LLM）在各个领域展现出了卓越的性能。然而，其安全性漏洞，特别是越狱攻击（jailbreak attack）问题，引发了广泛关注。本研究提出了一种新颖且更易实施的间接越狱攻击方法，称为自检索诱导自越狱（SRIS）。该方法利用 LLM 自身的知识检索能力，基于其内部生成的信息，引导模型输出有害内容。SRIS不依赖外部知识，使越狱攻击更加可行且更易执行。在七个当前最先进的 LLM 上进行了广泛实验。实验结果表明，SRIS 在攻击成功率方面显著优于现有方法，最高可达 74.76%（GPT-3.5）和 56.8%（GPT-4）。在大多数问题领域中，SRIS 方法的攻击成功率显著领先，表现出较强的稳定性和广泛适用性。研究结果凸显了在 LLM 训练过程中谨慎选择训练数据的必要性。我们呼吁进一步研究更安全的开发实践，以提升 LLM 在实际应用中的安全性和可靠性。

Abstract: Large Language Model(LLM) has demonstrated remarkable performance across various domains. However, their security vulnerabilities, particularly regarding jailbreak attacks, have raised significant concerns. In this work, we propose a novel and more practical indirect jailbreak attack method, named Self-Retrieval Induced Self-Jailbreak (SRIS). It retrieves knowledge from within the LLM itself, utilizing internally generated information to induce the model to produce harmful responses. our approach does not rely on external knowledge, making the jailbreak attack more feasible and easier to execute. We have conducted extensive experiments on seven state-of-the-art LLMs. The experimental results show that SRIS significantly outperforms existing methods in attack success rates, with a maximum success rate of 74.76% on GPT-3.5 and 56.8% on GPT-4. In most of the question domains, SRIS Significantly outperforms others, highlighting robustness and broad applicability. Our findings emphasize the importance of careful training data selection for LLMs. We advocate for further research into safer development practices to improve the overall security and reliability of LLMs in practical applications.

王闪闪, 杜存鹏, 王星童, 马昊, 陈贞翔, 杨波. 大型语言模型自检索再生成越狱攻击[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252266.

Wang ShanShan, Du CunPeng, Wang XingTong, Ma Hao, Chen ZhenXiang, Yang Bo. LLM Jailbreaks Itself with Self-Retrieval and Re-Generation[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252266.

参考文献

[1] 王耀祖, 李擎, 戴张杰, 等. 大语言模型研究现状与趋势[J]. 工程科学学报, 2024, 46(8): 1411-1425. Wang Yaozu, Li Qing, Dai Zhangjie, et al. "A Survey on the Current Status and Trends of Large Language Model Research." Journal of Engineering Science, 2024, Vol. 46, No. 8, pp. 1411–1425.
[2] ACHIAM, J., ADLER, S., AGARWAL, S., AHMAD, L., AKKAYA, I., ALEMAN, F.L., ALMEIDA, D., ALTENSCHMIDT, J., ALTMAN, S., ANADKAT, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[3] TOUVRON, H., MARTIN, L., STONE, K., ALBERT, P., ALMAHAIRI, A., BABAEI, Y., BASH LYKOV, N., BATRA, S., BHARGAVA, P., BHOSALE, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[4] TEAM, G., ANIL, R., BORGEAUD, S., ALAYRAC, J.B., YU, J., SORICUT, R., SCHALKWYK, J., DAI, A.M., HAUTH, A., MILLICAN, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[5] STAHLBERG, F.: Neural machine translation: A review. Journal of Artificial Intelli gence Research 69, 343–418 (2020)
[6] EGHBALI, A., PRADEL, M.: De-hallucinator: Iterative grounding for llm-based code completion. arXiv preprint arXiv:2401.01701 (2024)
[7] LIANG, T., JIN, C., WANG, L., FAN, W., XIA, C., CHEN, K., YIN, Y.: Llm-redial: A large-scale dataset for conversational recommender systems created from user behaviors with llms. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 8926–8939 (2024)
[8] WANG, B., CHEN, W., PEI, H., XIE, C., KANG, M., ZHANG, C., XU, C., XIONG, Z., DUTTA, R., SCHAEFFER, R., et al.: Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In: NeurIPS (2023)
[9] DENG, G., LIU, Y., LI, Y., WANG, K., ZHANG, Y., LI, Z., WANG, H., ZHANG, T., LIU, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Proc. ISOC NDSS (2024)
[10] LIU, Y., DENG, G., XU, Z., LI, Y., ZHENG, Y., ZHANG, Y., ZHAO, L., ZHANG, T., WANG, K., LIU, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)
[11] 台建玮, 杨双宁, 王佳佳, 李亚凯, 刘奇旭, 贾晓启. 大语言模型对抗性攻击与防御综述[J]. 计算机研究与发展,2025,62(3):563-588.DOI:0.7544/issn1000-1239.202440630 Tai Jianwei, Yang Shuangning, Wang Jiajia, Li Yakai, Liu Qixu, Jia Xiaoqi. A Survey on Adversarial Attacks and Defenses of Large Language Models. Journal of Computer Research and Development, 2025, 62(3): 563–588. DOI: 10.7544/issn1000-1239.202440630.
[12] HUANG, X., RUAN, W., HUANG, W., et al.: A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review 57(7), 175 (2024)
[13] WU, F., ZHANG, N., JHA, S., et al.: A new era in llm security: Exploring security concerns in real-world llm-based systems. arXiv preprint (2024), arXiv:2402.18649
[14] YAO, Y., DUAN, J., XU, K., et al.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing p. 100211 (2024)
[15] ZHOU, W., ZHU, X., HAN, Q.L., et al.: The security of using large language models: A survey with emphasis on chatgpt. IEEE/CAA Journal of Automatica Sinica (2024)
[16] DENG, G., LIU, Y., WANG, K., LI, Y., ZHANG, T., LIU, Y.: Pandora: Jailbreak gpts by retrieval augmented generation poisoning. arXiv preprint arXiv:2402.08416 (2024)
[17] 梁思源, 何英哲, 刘艾杉, 等. 面向大语言模型的越狱攻击与防御综述[J]. 信息安全学报, 2024, 9(5): 56-86. Liang Siyuan, He Yingzhe, Liu Aishan, et al. "A Survey of Jailbreak Attacks and Defenses Targeting Large Language Models." Journal of Information Security, 2024, Vol. 9, No. 5, pp. 56–86.
[18] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[19] WANG, Z., LIU, J., ZHANG, S., YANG, Y.: Poisoned langchain: Jailbreak llms by langchain. arXiv preprint arXiv:2406.18122 (2024)
[20] YUAN, Y., JIAO, W., WANG, W., et al.: Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. In: The Twelfth International Conference on Learning Representations (2024)
[21] JIANG, F., XU, Z., NIU, L., et al.: Artprompt: Ascii art-based jailbreak attacks against aligned llms. In: ICLR 2024 Workshop on Secure and Trustworthy Large Language Models (2024)
[22] SHEN, X., WU, Y., BACKES, M., et al.: Voice jailbreak attacks against gpt-4o. arXiv preprint (2024), arXiv:2405.19103
[23] CHAO, P., ROBEY, A., DOBRIBAN, E., et al.: Jailbreaking black box large language models in twenty queries. In: R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2024)
[24] MEHROTRA, A., ZAMPETAKIS, M., KASSIANIK, P., et al.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint (2023), arXiv:2312.02119
[25] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[26] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[27] ZHANG, Q., ZENG, B., ZHOU, C., et al.: Human-imperceptible retrieval poisoning attacks in llm-powered applications. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. pp. 502–506 (2024)
[28] CHENG, P., DING, Y., JU, T., et al.: Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint (2024)
[29] XIE, Y., YI, J., SHAO, J., et al.: Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence 5(12), 1486–1496 (2023)
[30] ZHANG, Z., YANG, J., KE, P., et al.: Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint (2023), arXiv:2311.09096
[31] PIET, J., ALRASHED, M., SITAWARIN, C., et al.: Jatmo: Prompt injection defense by task-specific finetuning. In: European Symposium on Research in Computer Secu rity. pp. 105–124. Springer Nature Switzerland (2024)
[32] PISANO, M., LY, P., SANDERS, A., et al.: Bergeron: Combating adversarial at tacks through a conscience-based alignment framework. arXiv preprint (2023), arXiv:2312.00029
[33] HU, Z., WU, G., MITRA, S., et al.: Token-level adversarial prompt detection based on perplexity measures and contextual information. arXiv preprint (2023), arXiv:2311.11509
[34] KUMAR, A., AGARWAL, C., SRINIVAS, S., et al.: Certifying llm safety against adversarial prompting. arXiv preprint (2023), arXiv:2309.02705
[35] ROBEY, A., WONG, E., HASSANI, H., et al.: Smoothllm: Defending large language models against jailbreaking attacks. In: R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2024)
[36] MAZEIKA, M., PHAN, L., YIN, X., ZOU, A., WANG, Z., MU, N., SAKHAEE, E., LI, N., BASART, S., LI, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)
[37] ZOU, A., WANG, Z., CARLINI, N., NASR, M., KOLTER, J.Z., FREDRIKSON, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
[38] Yuan Y, Jiao W, Wang W, et al. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher[J]. arXiv preprint arXiv:2308.06463, 2023.
[39] ZOU, A., WANG, Z., KOLTER, J.Z., FREDRIKSON, M.: Universal and transferable adver sarial attacks on aligned language models. CoRR abs/2307.15043 (2023)
[40] QI, X., ZENG, Y., XIE, T., CHEN, P., JIA, R., MITTAL, P., HENDERSON, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

选择文件类型/文献管理软件名称

选择包含的内容