作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (11): 187-194, 210. doi: 10.19678/j.issn.1000-3428.0066805

• 网络空间安全 • 上一篇    下一篇

暗网网页用户身份信息聚合方法

王雨燕1, 赵佳鹏1, 时金桥1, 申立艳1, 刘洪梦1, 杨燕燕2   

  1. 1. 北京邮电大学 网络空间安全学院, 北京 100876
    2. 中国人民公安大学 信息网络安全学院, 北京 100038
  • 收稿日期:2023-01-20 出版日期:2023-11-15 发布日期:2023-11-08
  • 作者简介:

    王雨燕(1997—),女,硕士研究生,主研方向为文本信息处理、知识图谱

    赵佳鹏,博士后

    时金桥,教授级高级工程师

    申立艳,博士后

    刘洪梦,硕士研究生

    杨燕燕,硕士研究生

  • 基金资助:
    广东省重点研发计划(2019B010137003)

User Identity Information Aggregation Method for Darknet Web Page

Yuyan WANG1, Jiapeng ZHAO1, Jinqiao SHI1, Liyan SHEN1, Hongmeng LIU1, Yanyan YANG2   

  1. 1. School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. School of Information Network Security, People's Public Security University of China, Beijing 100038, China
  • Received:2023-01-20 Online:2023-11-15 Published:2023-11-08

摘要:

暗网网页中用户身份标识信息的分布呈现出稀疏、不规律的特点,当前主流的信息聚合技术无法直接应用于该场景。提出一种基于共指关系抽取的用户身份信息聚合模型,该模型输入一对用户身份信息及其上下文语境,返回该信息对是否包含共指关系,并且构建相应的用户身份信息数据集用于聚合实验。为进一步提升模型的识别能力,在基线模型的基础上引入实体类别信息,提出实体类别敏感的共指关系抽取模型。针对暗网中通过某些身份类别信息无法获取足够多训练样本的问题,引入少样本学习任务,构建基于多任务的低资源条件下用户身份信息聚合模型。实验结果表明,在低资源条件下,经过优化的聚合模型F1值达到87.03%,较基线模型提升11.98个百分点。

关键词: 暗网, 用户身份信息, 信息聚合, 关系抽取, 少样本学习, 多任务学习

Abstract:

The distribution of user identity information dispersed across darknet Web pages exhibits sparse and irregular characteristics, and current mainstream information aggregation techniques cannot be directly applied to this context. This study proposes a user identity information aggregation model based on coreference relation extraction.The model inputs a pair of user identity information and its contextual background, determines whether the information pair contains a coreference relation, and constructs a corresponding user identity information dataset for aggregation experiments. To further enhance the recognition ability of the model, the baseline model is enriched with entity category information, leading to the proposal of an entity category-sensitive coreference relation extraction model.To address the inability to obtain sufficient training samples through certain identity category information in darknet, a few-shot learning task is introduced to construct a multitask-based user identity information aggregation model under low-resource conditions.The experimental results show that, under low-resource conditions, the F1 value of the optimized aggregation model reaches 87.03%, which is 11.98 percentage points higher than that of the baseline model.

Key words: darknet, user identity information, information aggregation, relation extraction, few-shot learning, multi-task learning