作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (3): 211-221. doi: 10.19678/j.issn.1000-3428.0068805

• 网络空间安全 • 上一篇    下一篇

暗网高质量威胁情报获取技术与实现

汪溢镭1,*(), 孙歆1, 韩嘉佳1, 郭绍华2, 胡钺琳2, 邹福泰2   

  1. 1. 国网浙江省电力有限公司电力科学研究院, 浙江 杭州 310011
    2. 上海交通大学电子信息与电气工程学院, 上海 200240
  • 收稿日期:2024-02-02 修回日期:2024-09-14 出版日期:2026-03-15 发布日期:2024-11-19
  • 通讯作者: 汪溢镭
  • 作者简介:

    汪溢镭, 男, 工程师、硕士, 主研方向为网络安全

    孙歆, 教授级高级工程师

    韩嘉佳, 高级工程师、硕士

    郭绍华, 硕士

    胡钺琳, 硕士

    邹福泰, 副教授、博士

  • 基金资助:
    国网科技项目(5700-202319297A-1-1-ZN)

Techniques and Implementation of High-Quality Threat Intelligence Acquisition from the Dark Web

WANG Yilei1,*(), SUN Xin1, HAN Jiajia1, GUO Shaohua2, HU Yuelin2, ZOU Futai2   

  1. 1. State Grid Zhejiang Electric Power Co., Ltd. Electric Power Research Institute, Hangzhou 310011, Zhejiang, China
    2. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2024-02-02 Revised:2024-09-14 Online:2026-03-15 Published:2024-11-19
  • Contact: WANG Yilei

摘要:

暗网中存在着大量网络攻击或网络犯罪的隐秘信息, 以往研究主要针对通用开源威胁情报进行分析, 或在暗网威胁情报的某一方面做出工作, 缺少一种系统的方式对暗网信息进行处理和分析, 同时忽略了暗网信息的特性。为了对暗网庞杂的内容进行分析、筛选和提取, 利用与网络安全威胁相关的情报, 提出一种暗网高质量威胁情报获取技术, 其由信息爬取、主题聚类、实体识别和时新性检测4个模块组成。以暗网论坛为例, 通过针对暗网论坛的爬虫来爬取多个论坛的数据, 使用Top2Vec将论坛的标题和帖文分别以词和文档的形式嵌入到同一向量空间中, 分析帖文的讨论主题, 粗粒度地筛选出威胁情报相关内容, 去除爬取信息中的噪声, 然后使用命名实体识别的方式进行细粒度筛选, 提取出帖文中的威胁情报实体词。在此基础上, 计算实体词在明网中的信息量, 以评估所提取的信息的重要性, 最终筛选出高质量的网络安全相关暗网威胁情报。实验结果表明, 该方法具有有效性, 能够从收集的暗网信息中提取出网络威胁情报。

关键词: 暗网, 威胁情报, 主题分类, 命名实体识别, 信息量

Abstract:

There is a large amount of hidden information about cyber attacks or cybercrime in the dark web. Previous studies have mainly focused on analyzing general open source threat intelligence or working on a certain aspect of the dark web threat intelligence, lacking a systematic method to process and analyze dark web information and ignoring its characteristics. In order to analyze, screen, and extract the vast content of the dark web, a high-quality threat intelligence acquisition technology for the dark web is proposed using intelligence related to network security threats. It consists of four modules: information crawling, topic clustering, entity recognition, and novelty detection. Considering the dark web forum as an example, data from multiple forums are crawled by a crawler targeting the dark web forum. Top2Vec is used to embed the forum titles and posts into the same vector space in the form of words and documents, respectively. The discussion topics of the posts are analyzed, and threat intelligence-related contents are screened for coarse grains to remove noise from the crawled information. Then, named entity recognition is used for fine-grained filtering to extract threat intelligence entity words from the posts. On this basis, the information content of the entity words in the open web is calculated to evaluate the importance of the extracted information and ultimately select high-quality network security-related dark web threat intelligence. The experimental results show that this method is effective and can extract network threat intelligence from the collected dark web information.

Key words: dark web, threat intelligence, topic classification, name entity recognition, amount of information