[1] 王耀祖, 李擎, 戴张杰, 等. 大语言模型研究现状与趋势[J]. 工程科学学报, 2024, 46(8): 1411-1425.
Wang Yaozu, Li Qing, Dai Zhangjie, et al. "A Survey on the Current Status and Trends of Large Language Model Research." Journal of Engineering Science, 2024, Vol. 46, No. 8, pp. 1411–1425.
[2] ACHIAM, J., ADLER, S., AGARWAL, S., AHMAD, L., AKKAYA, I., ALEMAN, F.L., ALMEIDA, D., ALTENSCHMIDT, J., ALTMAN, S., ANADKAT, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[3] TOUVRON, H., MARTIN, L., STONE, K., ALBERT, P., ALMAHAIRI, A., BABAEI, Y., BASH LYKOV, N., BATRA, S., BHARGAVA, P., BHOSALE, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[4] TEAM, G., ANIL, R., BORGEAUD, S., ALAYRAC, J.B., YU, J., SORICUT, R., SCHALKWYK, J., DAI, A.M., HAUTH, A., MILLICAN, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[5] STAHLBERG, F.: Neural machine translation: A review. Journal of Artificial Intelli gence Research 69, 343–418 (2020)
[6] EGHBALI, A., PRADEL, M.: De-hallucinator: Iterative grounding for llm-based code completion. arXiv preprint arXiv:2401.01701 (2024)
[7] LIANG, T., JIN, C., WANG, L., FAN, W., XIA, C., CHEN, K., YIN, Y.: Llm-redial: A large-scale dataset for conversational recommender systems created from user behaviors with llms. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 8926–8939 (2024)
[8] WANG, B., CHEN, W., PEI, H., XIE, C., KANG, M., ZHANG, C., XU, C., XIONG, Z., DUTTA, R., SCHAEFFER, R., et al.: Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In: NeurIPS (2023)
[9] DENG, G., LIU, Y., LI, Y., WANG, K., ZHANG, Y., LI, Z., WANG, H., ZHANG, T., LIU, Y.: Masterkey: Automated jailbreaking of large language model chatbots. In: Proc. ISOC NDSS (2024)
[10] LIU, Y., DENG, G., XU, Z., LI, Y., ZHENG, Y., ZHANG, Y., ZHAO, L., ZHANG, T., WANG, K., LIU, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)
[11] 台建玮, 杨双宁, 王佳佳, 李亚凯, 刘奇旭, 贾晓启. 大语言模型对抗性攻击与防御综述[J]. 计算机研究与发展,2025,62(3):563-588.DOI:0.7544/issn1000-1239.202440630
Tai Jianwei, Yang Shuangning, Wang Jiajia, Li Yakai, Liu Qixu, Jia Xiaoqi. A Survey on Adversarial Attacks and Defenses of Large Language Models. Journal of Computer Research and Development, 2025, 62(3): 563–588. DOI: 10.7544/issn1000-1239.202440630.
[12] HUANG, X., RUAN, W., HUANG, W., et al.: A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review 57(7), 175 (2024)
[13] WU, F., ZHANG, N., JHA, S., et al.: A new era in llm security: Exploring security concerns in real-world llm-based systems. arXiv preprint (2024), arXiv:2402.18649
[14] YAO, Y., DUAN, J., XU, K., et al.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing p. 100211 (2024)
[15] ZHOU, W., ZHU, X., HAN, Q.L., et al.: The security of using large language models: A survey with emphasis on chatgpt. IEEE/CAA Journal of Automatica Sinica (2024)
[16] DENG, G., LIU, Y., WANG, K., LI, Y., ZHANG, T., LIU, Y.: Pandora: Jailbreak gpts by retrieval augmented generation poisoning. arXiv preprint arXiv:2402.08416 (2024)
[17] 梁思源, 何英哲, 刘艾杉, 等. 面向大语言模型的越狱攻击与防御综述[J]. 信息安全学报, 2024, 9(5): 56-86.
Liang Siyuan, He Yingzhe, Liu Aishan, et al. "A Survey of Jailbreak Attacks and Defenses Targeting Large Language Models." Journal of Information Security, 2024, Vol. 9, No. 5, pp. 56–86.
[18] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[19] WANG, Z., LIU, J., ZHANG, S., YANG, Y.: Poisoned langchain: Jailbreak llms by langchain. arXiv preprint arXiv:2406.18122 (2024)
[20] YUAN, Y., JIAO, W., WANG, W., et al.: Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. In: The Twelfth International Conference on Learning Representations (2024)
[21] JIANG, F., XU, Z., NIU, L., et al.: Artprompt: Ascii art-based jailbreak attacks against aligned llms. In: ICLR 2024 Workshop on Secure and Trustworthy Large Language Models (2024)
[22] SHEN, X., WU, Y., BACKES, M., et al.: Voice jailbreak attacks against gpt-4o. arXiv preprint (2024), arXiv:2405.19103
[23] CHAO, P., ROBEY, A., DOBRIBAN, E., et al.: Jailbreaking black box large language models in twenty queries. In: R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2024)
[24] MEHROTRA, A., ZAMPETAKIS, M., KASSIANIK, P., et al.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint (2023), arXiv:2312.02119
[25] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[26] TU, S., PAN, Z., WANG, W., ZHANG, Z., SUN, Y., YU, J., WANG, H., HOU, L., LI, J.: Knowledge-to-jailbreak: One knowledge point worth one attack. arXiv preprint arXiv:2406.11682 (2024)
[27] ZHANG, Q., ZENG, B., ZHOU, C., et al.: Human-imperceptible retrieval poisoning attacks in llm-powered applications. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. pp. 502–506 (2024)
[28] CHENG, P., DING, Y., JU, T., et al.: Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint (2024)
[29] XIE, Y., YI, J., SHAO, J., et al.: Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence 5(12), 1486–1496 (2023)
[30] ZHANG, Z., YANG, J., KE, P., et al.: Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint (2023), arXiv:2311.09096
[31] PIET, J., ALRASHED, M., SITAWARIN, C., et al.: Jatmo: Prompt injection defense by task-specific finetuning. In: European Symposium on Research in Computer Secu rity. pp. 105–124. Springer Nature Switzerland (2024)
[32] PISANO, M., LY, P., SANDERS, A., et al.: Bergeron: Combating adversarial at tacks through a conscience-based alignment framework. arXiv preprint (2023), arXiv:2312.00029
[33] HU, Z., WU, G., MITRA, S., et al.: Token-level adversarial prompt detection based on perplexity measures and contextual information. arXiv preprint (2023), arXiv:2311.11509
[34] KUMAR, A., AGARWAL, C., SRINIVAS, S., et al.: Certifying llm safety against adversarial prompting. arXiv preprint (2023), arXiv:2309.02705
[35] ROBEY, A., WONG, E., HASSANI, H., et al.: Smoothllm: Defending large language models against jailbreaking attacks. In: R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2024)
[36] MAZEIKA, M., PHAN, L., YIN, X., ZOU, A., WANG, Z., MU, N., SAKHAEE, E., LI, N., BASART, S., LI, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024)
[37] ZOU, A., WANG, Z., CARLINI, N., NASR, M., KOLTER, J.Z., FREDRIKSON, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
[38] Yuan Y, Jiao W, Wang W, et al. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher[J]. arXiv preprint arXiv:2308.06463, 2023.
[39] ZOU, A., WANG, Z., KOLTER, J.Z., FREDRIKSON, M.: Universal and transferable adver sarial attacks on aligned language models. CoRR abs/2307.15043 (2023)
[40] QI, X., ZENG, Y., XIE, T., CHEN, P., JIA, R., MITTAL, P., HENDERSON, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)
|