暗网高质量威胁情报获取技术与实现

doi:10.19678/j.issn.1000-3428.0068805

计算机工程 ›› 2026, Vol. 52 ›› Issue (3): 211-221. doi: 10.19678/j.issn.1000-3428.0068805

暗网高质量威胁情报获取技术与实现

汪溢镭¹^,*(), 孙歆¹, 韩嘉佳¹, 郭绍华², 胡钺琳², 邹福泰²

1. 国网浙江省电力有限公司电力科学研究院, 浙江杭州 310011
2. 上海交通大学电子信息与电气工程学院, 上海 200240

收稿日期:2024-02-02 修回日期:2024-09-14 出版日期:2026-03-15 发布日期:2024-11-19
通讯作者: 汪溢镭
作者简介:
汪溢镭, 男, 工程师、硕士, 主研方向为网络安全
孙歆, 教授级高级工程师
韩嘉佳, 高级工程师、硕士
郭绍华, 硕士
胡钺琳, 硕士
邹福泰, 副教授、博士
基金资助:
国网科技项目(5700-202319297A-1-1-ZN)

Techniques and Implementation of High-Quality Threat Intelligence Acquisition from the Dark Web

WANG Yilei¹^,*(), SUN Xin¹, HAN Jiajia¹, GUO Shaohua², HU Yuelin², ZOU Futai²

1. State Grid Zhejiang Electric Power Co., Ltd. Electric Power Research Institute, Hangzhou 310011, Zhejiang, China
2. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Received:2024-02-02 Revised:2024-09-14 Online:2026-03-15 Published:2024-11-19
Contact: WANG Yilei

摘要/Abstract

摘要：

暗网中存在着大量网络攻击或网络犯罪的隐秘信息, 以往研究主要针对通用开源威胁情报进行分析, 或在暗网威胁情报的某一方面做出工作, 缺少一种系统的方式对暗网信息进行处理和分析, 同时忽略了暗网信息的特性。为了对暗网庞杂的内容进行分析、筛选和提取, 利用与网络安全威胁相关的情报, 提出一种暗网高质量威胁情报获取技术, 其由信息爬取、主题聚类、实体识别和时新性检测4个模块组成。以暗网论坛为例, 通过针对暗网论坛的爬虫来爬取多个论坛的数据, 使用Top2Vec将论坛的标题和帖文分别以词和文档的形式嵌入到同一向量空间中, 分析帖文的讨论主题, 粗粒度地筛选出威胁情报相关内容, 去除爬取信息中的噪声, 然后使用命名实体识别的方式进行细粒度筛选, 提取出帖文中的威胁情报实体词。在此基础上, 计算实体词在明网中的信息量, 以评估所提取的信息的重要性, 最终筛选出高质量的网络安全相关暗网威胁情报。实验结果表明, 该方法具有有效性, 能够从收集的暗网信息中提取出网络威胁情报。

关键词: 暗网, 威胁情报, 主题分类, 命名实体识别, 信息量

Abstract:

There is a large amount of hidden information about cyber attacks or cybercrime in the dark web. Previous studies have mainly focused on analyzing general open source threat intelligence or working on a certain aspect of the dark web threat intelligence, lacking a systematic method to process and analyze dark web information and ignoring its characteristics. In order to analyze, screen, and extract the vast content of the dark web, a high-quality threat intelligence acquisition technology for the dark web is proposed using intelligence related to network security threats. It consists of four modules: information crawling, topic clustering, entity recognition, and novelty detection. Considering the dark web forum as an example, data from multiple forums are crawled by a crawler targeting the dark web forum. Top2Vec is used to embed the forum titles and posts into the same vector space in the form of words and documents, respectively. The discussion topics of the posts are analyzed, and threat intelligence-related contents are screened for coarse grains to remove noise from the crawled information. Then, named entity recognition is used for fine-grained filtering to extract threat intelligence entity words from the posts. On this basis, the information content of the entity words in the open web is calculated to evaluate the importance of the extracted information and ultimately select high-quality network security-related dark web threat intelligence. The experimental results show that this method is effective and can extract network threat intelligence from the collected dark web information.

Key words: dark web, threat intelligence, topic classification, name entity recognition, amount of information

汪溢镭, 孙歆, 韩嘉佳, 郭绍华, 胡钺琳, 邹福泰. 暗网高质量威胁情报获取技术与实现[J]. 计算机工程, 2026, 52(3): 211-221.

WANG Yilei, SUN Xin, HAN Jiajia, GUO Shaohua, HU Yuelin, ZOU Futai. Techniques and Implementation of High-Quality Threat Intelligence Acquisition from the Dark Web[J]. Computer Engineering, 2026, 52(3): 211-221.

https://www.ecice06.com/CN/Y2026/V52/I3/211

图/表 19

图1 威胁情报提取系统框架

Fig.1 Framework of threat intelligence extraction system

图2 爬虫流程

Fig.2 Crawler process

图3 Word2Vec语义坐标

Fig.3 Word2Vec semantic coordinate

图4 文档和词向量共同嵌入空间

Fig.4 Document and word vectors embedded in a common space

图5 实体提取模块

Fig.5 Entity extraction module

图6 文本嵌入分布

Fig.6 Distribution of text embeddings

图7 恶意软件相关的未被选中帖文示例

Fig.7 Examples of unselected posts related to malicious software

图8 恶意软件相关的被选中帖文示例

Fig.8 Examples of selected posts related to malicious software

参考文献 25

1	ITRC-your life, your identity[EB/OL]. [2023-08-05]. https://www.idtheftcenter.org/.
2	OMAR Z M , IBRAHIM J . An overview of darknet, rise and challenges and its assumptions. International Journal of Computer Science and Information Technology, 2020, 8(3): 110- 116.
3	赵新强, 范博, 张东举. 基于威胁发现的APT攻击防御体系研究. 信息网络安全, 2024, 24(7): 1122- 1128.
	ZHAO X Q , FAN B , ZHANG D J . Research on APT attack defense system based on threat discovery. Netinfo Security, 2024, 24(7): 1122- 1128.
4	胡锦枫, 徐晓瑀, 陈云芳, 等. 基于v3洋葱域名的比特币地址威胁程度分析. 计算机工程, 2024, 50(3): 173- 181. doi: 10.19678/j.issn.1000-3428.0066649
	HU J F , XU X Y , CHEN Y F , et al. Threat level analysis of Bitcoin address based on v3 onion domain name. Computer Engineering, 2024, 50(3): 173- 181. doi: 10.19678/j.issn.1000-3428.0066649
5	WILLIAM P, CHOUBEY S, CHOUBEY A, et al. Darknet traffic analysis and network management for malicious intent detection by neural network frameworks[EB/OL]. [2023-08-05]. https://www.igi-global.com/chapter/darknet-traffic-analysis-and-network-management-for-malicious-intent-detection-by-neural-network-frameworks/307866.
6	SARWAR M B , HANIF M K , TALIB R , et al. DarkDetect: darknet traffic detection and categorization using modified convolution-long short-term memory. IEEE Access, 2021, 9, 113705- 113713. doi: 10.1109/ACCESS.2021.3105000
7	SANGHER K S , SINGH A , PANDEY H M . LSTM and BERT based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums. International Journal of Information Technology, 2024, 16(8): 5277- 5292. doi: 10.1007/s41870-024-02077-5
8	MOHANTY H , ROUDSARI A H , LASHKARI A H . Robust stacking ensemble model for darknet traffic classification under adversarial settings. Computers & Security, 2022, 120, 102830.
9	DEGUARA N, ARSHAD J, PARACHA A, et al. Threat miner-a text analysis engine for threat identification using dark web data[C]//Proceedings of the IEEE International Conference on Big Data. Washington D.C., USA: IEEE Press, 2023: 3043-3052.
10	KADOGUCHI M, HAYASHI S, HASHIMOTO M, et al. Exploring the dark web for cyber threat intelligence using machine leaning[C]//Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI). Washington D.C., USA: IEEE Press, 2019: 200-202.
11	NUNES E, DIAB A, GUNN A, et al. Darknet and deepnet mining for proactive cybersecurity threat intelligence[C]//Proceedings of the IEEE Conference on Intelligence and Security Informatics (ISI). Washington D.C., USA: IEEE Press, 2016: 7-12.
12	LIAO X J, YUAN K, WANG X F, et al. Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, USA: ACM Press, 2016: 755-766.
13	杨竣辉, 李苏晋. 融合位置和实体类别信息的中文命名实体识别. 计算机工程, 2025, 51(3): 113- 121. doi: 10.19678/j.issn.1000-3428.0068741
	YANG J H , LI S J . Chinese named entity recognition integrating positional and entity category information. Computer Engineering, 2025, 51(3): 113- 121. doi: 10.19678/j.issn.1000-3428.0068741
14	LIU Z H, JIANG F J, HU Y X, et al. NER-BERT: a pre-trained model for low-resource entity tagging[EB/OL]. [2023-08-05]. https://arxiv.org/abs/2112.00405.
15	ZHAO J, YAN Q, LIU X, et al. Cyber threat intelligence modeling based on heterogeneous graph convolutional network[EB/OL]. [2023-08-05]. https://www.usenix.org/system/files/raid20-zhao.pdf.
16	WANG X R, LIU X P, AO S Q, et al. DNRTI: a large-scale dataset for named entity recognition in threat intelligence[C]//Proceedings of the 19th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). Washington D.C., USA: IEEE Press, 2021: 1842-1848.
17	EVANGELATOS P, ILIOU C, MAVROPOULOS T, et al. Named entity recognition in cyber threat intelligence using transformer-based models[C]//Proceedings of the IEEE International Conference on Cyber Security and Resilience (CSR). Washington D.C., USA: IEEE Press, 2021: 348-353.
18	YANG Y R, LIU Z, SONG J X. TRAPPER: learning with weak supervision for threat intelligence entity recognition[C]//Proceedings of the 4th International Conference on Advanced Information Science and System. Washington D.C., USA: IEEE Press, 2023: 1-7.
19	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1301.3781.
20	ANGELOV D. Top2Vec: distributed representations of topics[EB/OL]. [2023-08-05]. https://arxiv.org/abs/2008.09470.
21	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1810.04805.
22	WANG X R, HE S H, XIONG Z H, et al. APTNER: a specific dataset for NER missions in cyber threat intelligence field[C]//Proceedings of the 25th IEEE International Conference on Computer Supported Cooperative Work in Design (CSCWD). Washington D.C., USA: IEEE Press, 2022: 1233-1238.
23	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1508.01991.
24	WHITE L, TOGNERI R, LIU W, et al. How well sentence embeddings capture meaning[C]//Proceedings of the 20th Australasian Document Computing Symposium. New York, USA: ACM Press, 2015: 1-8.
25	XIANG G , SHI C , ZHANG Y S . An APT event extraction method based on BERT-BiGRU-CRF for APT attack detection. Electronics, 2023, 12(15): 3349. doi: 10.3390/electronics12153349

[1]	杨竣辉, 李苏晋. 融合位置和实体类别信息的中文命名实体识别[J]. 计算机工程, 2025, 51(3): 113-121.
[2]	郭桦宜, 游进国, 耿齐祁, 陶静梅, 易健宏. 面向铜基复合材料文献的复杂实体关系抽取方法[J]. 计算机工程, 2025, 51(11): 100-111.
[3]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[4]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[5]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[6]	吴凡, 徐朝农, 邹英豪. 基于PD-NOMA的人员监控图像传输算法[J]. 计算机工程, 2024, 50(6): 266-275.
[7]	隗昊, 刁宏悦, 孔亮宸, 邓耀臣. 东北亚舆情文本细粒度命名实体识别方法研究[J]. 计算机工程, 2024, 50(5): 354-362.
[8]	胡锦枫, 徐晓瑀, 陈云芳, 张伟. 基于v3洋葱域名的比特币地址威胁程度分析[J]. 计算机工程, 2024, 50(3): 173-181.
[9]	刘威, 马磊, 李凯, 李蓉. 基于多粒度字形增强的中文医学命名实体识别[J]. 计算机工程, 2024, 50(2): 337-344.
[10]	高锐涛, 林达伟, 郭亮, 金鸿, 王红. 基于知识图谱的水稻种植智能问答系统设计与实现[J]. 计算机工程, 2024, 50(12): 133-141.
[11]	倪渊, 廖世豪, 张健. 基于Wobert与对抗学习的中文命名实体识别[J]. 计算机工程, 2024, 50(11): 119-129.
[12]	任义, 苏博, 袁帅. 教育领域下多维度特征命名实体识别方法[J]. 计算机工程, 2024, 50(10): 110-118.
[13]	唐卓然, 柳毅. 基于词汇融合和依存关系的中文命名实体识别[J]. 计算机工程, 2024, 50(10): 145-153.
[14]	杨长沛, 廖列法. 基于门控空洞卷积特征融合的中文命名实体识别[J]. 计算机工程, 2023, 49(8): 85-95.
[15]	张家熔, 苑津莎, 许珈宁, 罗志宏. 基于多元信息嵌入与协同神经网络的力学实体识别算法[J]. 计算机工程, 2023, 49(7): 125-134.

选择文件类型/文献管理软件名称

选择包含的内容

暗网高质量威胁情报获取技术与实现

Techniques and Implementation of High-Quality Threat Intelligence Acquisition from the Dark Web

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 19

参考文献 25

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

暗网高质量威胁情报获取技术与实现

Techniques and Implementation of High-Quality Threat Intelligence Acquisition from the Dark Web

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 19

参考文献 25

相关文章 15

编辑推荐

Metrics

本文评价