基于参考利用大语言模型的网络钓鱼检测方案

doi:10.19678/j.issn.1000-3428.0252140

摘要/Abstract

摘要： 在网络安全领域，网络钓鱼攻击日益复杂且频繁，传统基于预定义参考模板的网络钓鱼检测方案依赖品牌与域名映射列表，通过视觉特征匹配识别品牌意图并验证域名一致性，实现可解释的钓鱼检测。这类方案虽能抵御零日钓鱼攻击，但需持续更新参考列表以覆盖新兴品牌，而这也导致高昂的维护成本。为此，该方案利用大语言模型(LLM)和检索增强生成(RAG)技术提出了一种新颖的基于参考的网络钓鱼检测方案Phish-RAGLLM。Phish-RAGLLM无需依赖预定义的参考列表，将传统的视觉问题重构为语言问题，利用LLM蕴含的丰富品牌知识，并通过RAG技术结合外部品牌知识库增强模型生成能力，有效抑制了LLM可能出现的幻觉问题，提升了检测的精确度和鲁棒性。实验结果表明，与当前最佳模型PhishLLM相比，Phish-RAGLLM能权衡模型性能、推理成本以及知识库完备性，以GPT-3.5-turbo-instruct作为主干LLM，将F1分数提升了5.88%，运行效率提升了12.5%，且在面对数据集变化和提示注入攻击时表现出较强的鲁棒性。基于LLM的特性，Phish-RAGLLM对多语言钓鱼网站表现出良好的适应性，能够有效检测不同语言环境下的钓鱼网页。此外，实地评估表明该方案具有比VirusTotal这一威胁情报来源更为广泛的检测能力，进一步验证了其可行性和有效性。

Abstract: In the field of cybersecurity, phishing attacks are becoming increasingly complex and frequent. Traditional phishing detection schemes based on predefined reference templates rely on brand-domain mapping lists, using visual feature matching to identify brand intent and verify domain consistency for explainable detection. While these methods can counter zero-day phishing attacks, they face scalability challenges due to the need for continuous updates to reference lists to cover emerging brands, leading to high maintenance costs. To address these, the paper proposes Phish-RAGLLM, a novel reference-based phishing detection scheme leveraging Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). By reframing traditional visual problems into language tasks, Phish-RAGLLM eliminates reliance on predefined reference lists, utilizing LLMs' extensive brand knowledge while enhancing generation capabilities through RAG integration with external brand knowledge bases. This approach effectively mitigates LLM hallucination issues and improves detection precision and robustness. Experimental results demonstrate that compared to the current state-of-the-art model PhishLLM, Phish-RAGLLM—using GPT-3.5-turbo-instruct as the main LLM—balances model performance, inference cost and knowledge base completeness, achieving 5.88% increase in F1score and a 12.5% improvement in operational efficiency. Moreover, it shows strong robustness against dataset variations and prompt injection attacks. Based on the characteristics of LLM, Phish-RAGLLM exhibits good adaptability to multilingual phishing websites, effectively detecting phishing webpages in different linguistic contexts. Furthermore, real-world evaluations reveal that Phish-RAGLLM has broader detection capabilities than VirusTotal (a threat intelligence source), further validating its feasibility and effectiveness.

谢晴晴, 刘媛媛. 基于参考利用大语言模型的网络钓鱼检测方案[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252140.

XIE Qingqing, LIU Yuanyuan. Reference-based phishing detection scheme using LLM[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252140.

参考文献

[1] Anti-Phishing Working Group Phishing Activity Trends Reports [EB/OL] [2025-04-02]. https://apwg.org/trendsreports.
[2] Lin Y, Liu R, Divakaran D M, et al. Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages[C]//30th USENIX Security Symposium (USENIX Security 21). 2021: 3793-3810.
[3] Liu R, Lin Y, Yang X, et al. Inferring phishing intention via webpage appearance and dynamics: A deep visionbased approach[C]//31st USENIX Security Symposium (USENIX Security 22). 2022: 1633-1650.
[4] Liu R, Lin Y, Zhang Y, et al. Knowledge Expansion and Counterfactual Interaction for {Reference-Based} Phishing Detection[C]//32nd USENIX Security Symposium (USENIX Security 23). 2023: 4139-4156.
[5] Liu R, Lin Y, Teoh X, et al. Less defined knowledge and more true alarms: Reference-based phishing detection without a pre-defined reference list[C]//33rd USENIX Security Symposium (USENIX Security 24). 2024: 523-540.
[6] Li Y, Huang C, Deng S, et al. {KnowPhish}: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing {Reference-Based} Phishing Detection[C]//33rd USENIX Security Symposium (USENIX Security 24). 2024: 793-810.
[7] Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks[J]. Advances in neural information processing systems, 2020, 33: 9459-9474.
[8] VirusTotal-Home[EB/OL]. [2025-04-02]. https://www.virustotal.com/gui/home/upload
[9] Microsoft SmartScreen[EB/OL]. (2024-07-10) [2025-04-02]. https://learn.microsoft.com/en-us/windows/security/operat ing-system-security/virus-and-threat-protection/microsoftdefender-smartscreen/
[10] Google Safe Browsing[EB/OL]. [2025-04-02]. https://safebrowsing.google.com
[11] 张震,张三峰,杨望.基于图对比学习的恶意域名检测方法 [J]. 软件学报 , 2024, 000(10):22.DOI:10.13328/j.cnki.jos.006964. Zhang Z, Zhang SF, Yang W. Malicious Domain Name Detection Method Based on Graph Contrastive Learning[J]. Journal of Software, 2024, 000(10):22. DOI:10.13328/j.cnki.jos.006964.
[12] 靳婕靖,张永斌,冉崇善.错拼抢注域名研究综述[J]. 计算机工程,doi: 10.3969/j.issn.1000-3428.2018.03.027. Jin Jiejing, Zhang Yongbin, Ran Chongshan. Research Survey on Typosquatting Domain Names[J]. Computer Engineering, doi: 10.3969/j.issn.1000 - 3428.2018.03.027.
[13] Kim T, Park N, Hong J, et al. Phishing url detection: A network-based approach robust to evasion[C]//Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 2022: 1769-1782.
[14] Guo B, Zhang Y, Xu C, et al. HinPhish: An effective phishing detection approach based on heterogeneous information networks[J]. Applied Sciences, 2021, 11(20): 9733.
[15] 金建栋,黄正,胡占宇,等.基于智能体工作流的高级钓鱼邮件检测方法[J].通信学报,2024,45(S2):59-68. Jin Jiandong, Huang Zheng, Hu Zhanyu, et al. Advanced Phishing Email Detection Method Based on Agent Workflow[J]. Journal on Communications, 2024, 45(S2): 59 - 68.
[16] Tan C C L, Chiew K L, Yong K S C, et al. Hybrid phishing detection using joint visual and textual identity[J]. Expert systems with applications, 2023, 220: 119723.
[17] He B, Chen Y, Chen Z, et al. Txphishscope: Towards detecting and understanding transaction-based phishing on ethereum[C]//Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2023: 120-134.
[18] Kondracki B, Azad B A, Starov O, et al. Catching transparent phish: Analyzing and detecting mitm phishing toolkits[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 36-50.
[19] Zhang P, Oest A, Cho H, et al. Crawlphish: Large-scale analysis of client-side cloaking techniques in phishing[C]//2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021: 1109-1124.
[20] Roy S S, Thota P, Naragam K V, et al. From chatbots to phishbots?: Phishing scam generation in commercial large language models[C]//2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024: 36-54.
[21] Wikipedia[EB/OL]. [2025-04-02]. https://www.wikipedia.org/
[22] Pochat V L, Van Goethem T, Tajalizadehkhoob S, et al. Tranco: A research-oriented top sites ranking hardened against manipulation[J]. arXiv preprint arXiv:1806.01156, 2018.
[23] Phishtank[EB/OL]. [2025-04-02]. https://phishtank.com
[24] Openphish[EB/OL]. [2025-04-02]. https://openphish.com
[25] Alexa[EB/OL]. [2025-04-02]. https://www.alexa.com/topsites
[26] Es S, James J, Anke L E, et al. Ragas: Automated evaluation of retrieval augmented generation[C]//Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 2024: 150-158.

选择文件类型/文献管理软件名称

选择包含的内容