作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (8): 271-276. doi: 10.19678/j.issn.1000-3428.0055414

• 开发研究与工程应用 • 上一篇    下一篇

基于特征矩阵构造与BP神经网络的垃圾文本过滤模型

方瑞1, 于俊洋1, 董李锋2   

  1. 1. 河南大学 软件学院, 河南 开封 475000;
    2. 河南九域腾龙信息工程有限公司, 郑州 450000
  • 收稿日期:2019-07-08 修回日期:2019-08-21 发布日期:2019-08-26
  • 作者简介:方瑞(1995-),男,硕士研究生,主研方向为自然语言处理、文本分类;于俊洋(通信作者),副教授、博士;董李锋,学士。
  • 基金资助:
    国家自然科学基金(61602525);河南省科技发展计划项目(182102210229)。

Junk Text Filtering Model Based on Feature Matrix Construction and BP Neural Network

FANG Rui1, YU Junyang1, DONG Lifeng2   

  1. 1. School of Software, Henan University, Kaifeng, Henan 475000, China;
    2. Henan Jiuyu Tenglong Information Engineering Co., Ltd., Zhengzhou 450000, China
  • Received:2019-07-08 Revised:2019-08-21 Published:2019-08-26

摘要: 在网络社交平台海量的信息文本中含有许多垃圾文本,这些文本的广泛散布影响了人们正常社交。为此,提出一种垃圾文本过滤模型。通过BERT模型提取文本的句编码,采用B-Feature方法对句编码进行特征构造,并根据文本与所得特征之间的联系进一步将该特征构造为特征矩阵,运用BP神经网络分类器对特征矩阵进行处理,检测出垃圾文本并进行过滤。实验结果表明,该模型在长、中、短文本数据集上的准确率较TFIDF-BP模型分别提高7.8%、3.8%和11.7%,在中、短文本数据集上的准确率较朴素贝叶斯模型分别提高2.1%和13.7%,能有效对垃圾文本进行分类和过滤。

关键词: BERT模型, 特征构造, BP神经网络, 垃圾文本过滤, 文本分类, 句编码

Abstract: There are a lot of junk texts in the massive information of online social platforms,which hinder the normal social intercourse of people when they are widely spread.To address the problem,this paper proposes a junk text filtering model.The model uses the BERT model to extract sentence coding of the text.Then the feature of sentence coding is constructed by using the B-Feature method,and the obtained feature is further constructed as a feature matrix based on the relationship between the feature and the text.The feature matrix is processed by using a BP neural network classifier,and junk texts are detected and filtered.Experimental results show that the accuracy rate of the proposed model on text datasets of long,medium,and short length is respectively 7.8%,3.8% and 11.7% higher than that of the TFIDF-BP model,and the accuracy of the proposed model on text datasets of medium and short length is respectively 2.1% and 13.7% higher than that of the naive Bayes model,which can effectively classify and filter junk texts.

Key words: BERT model, feature construction, BP neural network, junk text filtering, text classification, sentence coding

中图分类号: