作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (8): 13-19, 28. doi: 10.19678/j.issn.1000-3428.0067003

• 热点与综述 • 上一篇    下一篇

基于长尾分类算法的网络不良信息分类

刘金硕, 王代辰, 邓娟, 王丽娜   

  1. 武汉大学 空天信息安全与可信计算教育部重点实验室, 武汉 430072
  • 收稿日期:2023-02-22 出版日期:2023-08-15 发布日期:2023-04-18
  • 作者简介:

    刘金硕(1973—),女,教授、博士、博士生导师,主研方向为网络舆情监控、数据挖掘、高性能计算

    王代辰,硕士研究生

    邓娟,副教授、博士

    王丽娜,教授、博士、博士生导师

  • 基金资助:
    国家自然科学基金(U193607); 国家重点研发计划(2020YFA0607902)

Classification of Harmful Information on Internet Based on Long-Tailed Classification Algorithm

Jinshuo LIU, Daichen WANG, Juan DENG, Lina WANG   

  1. Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan University, Wuhan 430072, China
  • Received:2023-02-22 Online:2023-08-15 Published:2023-04-18

摘要:

目前已有的网络不良信息分类方法大多忽略了数据不平衡、数据存在长尾分布的情况,使得模型在分类时偏向于数据量多的样本,无法很好地识别数据量少的样本,从而导致整体识别精度下降。针对该问题,提出一种用于长尾不良信息数据集的分类方法LTIC。将小样本学习与知识转移策略相结合,使用BERT模型学习头部类的权重,通过专门为小样本学习而提出的Prototyper网络得到头部类的原型,将头尾数据分开处理,从而避免一起训练而导致的数据不平衡问题。学习从原型到权重的映射关系,利用学到的知识将尾部类的原型转换为权重,然后连接头部类权重和尾部类权重得到最终的分类结果。实验结果表明:LTIC方法在Twitter和THUCNews数据集上分别达到82.7%和83.5%的分类准确率,且F1值相较非长尾模型有显著提升,有效提高了模型分类精度;与目前较新的长尾数据集分类方法BNN、OLTR等相比,该方法具有更好的分类效果,平均准确率提升了3%;当新的不良信息类别出现时,LTIC方法只需少量计算就可对其进行预测,准确率达到70%,具有良好的扩展性。

关键词: 不良信息分类, 数据不平衡, 长尾数据集, 小样本学习, 知识转移

Abstract:

Currently, most existing methods for classifying harmful information on Internet overlook imbalanced data and long-tailed distributions, biasing the model towards more numerous data samples during classification. This makes them unable to effectively identify small data samples, which results in a decrease in overall recognition accuracy. To address this issue, a classification method LTIC for long-tailed harmful information datasets is proposed. By integrating few-shot learning with knowledge transfer strategies, the BERT model is used to learn the weights of the head class. The prototype of the head class is obtained through a Prototyper network specifically designed for few-shot learning.This design allows for the processing of head and tail data separately, thereby avoiding the data imbalance caused by mutual training. Researchers then use the mapping relationship learned from the prototype to convert the prototype of the tail class into weights. Subsequently, the head and tail class weights are combined to obtain the final classification result. In experiments, the LTIC method achieves classification accuracies of 82.7% and 83.5% on the Twitter and THUCNews datasets, respectively. This method also significantly improves the F1 value compared to the non-long tailed model, thus effectively improving classification accuracy. When compared with the latest classification methods such as BNN and OLTR, this method exhibits superior classification performance on long-tailed datasets, with an average accuracy improvement of 3%. When new categories of harmful information emerge, the LTIC method demonstrates the capability to predict them with minimal computation, achieving an accuracy of 70% and showcasing impressive scalability.

Key words: classification of harmful information, data imbalance, long-tailed dataset, few-shot learning, knowledge transfer