Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2020, Vol. 46 ›› Issue (8): 271-276. doi: 10.19678/j.issn.1000-3428.0055414

• Development Research and Engineering Application • Previous Articles     Next Articles

Junk Text Filtering Model Based on Feature Matrix Construction and BP Neural Network

FANG Rui1, YU Junyang1, DONG Lifeng2   

  1. 1. School of Software, Henan University, Kaifeng, Henan 475000, China;
    2. Henan Jiuyu Tenglong Information Engineering Co., Ltd., Zhengzhou 450000, China
  • Received:2019-07-08 Revised:2019-08-21 Published:2019-08-26

基于特征矩阵构造与BP神经网络的垃圾文本过滤模型

方瑞1, 于俊洋1, 董李锋2   

  1. 1. 河南大学 软件学院, 河南 开封 475000;
    2. 河南九域腾龙信息工程有限公司, 郑州 450000
  • 作者简介:方瑞(1995-),男,硕士研究生,主研方向为自然语言处理、文本分类;于俊洋(通信作者),副教授、博士;董李锋,学士。
  • 基金资助:
    国家自然科学基金(61602525);河南省科技发展计划项目(182102210229)。

Abstract: There are a lot of junk texts in the massive information of online social platforms,which hinder the normal social intercourse of people when they are widely spread.To address the problem,this paper proposes a junk text filtering model.The model uses the BERT model to extract sentence coding of the text.Then the feature of sentence coding is constructed by using the B-Feature method,and the obtained feature is further constructed as a feature matrix based on the relationship between the feature and the text.The feature matrix is processed by using a BP neural network classifier,and junk texts are detected and filtered.Experimental results show that the accuracy rate of the proposed model on text datasets of long,medium,and short length is respectively 7.8%,3.8% and 11.7% higher than that of the TFIDF-BP model,and the accuracy of the proposed model on text datasets of medium and short length is respectively 2.1% and 13.7% higher than that of the naive Bayes model,which can effectively classify and filter junk texts.

Key words: BERT model, feature construction, BP neural network, junk text filtering, text classification, sentence coding

摘要: 在网络社交平台海量的信息文本中含有许多垃圾文本,这些文本的广泛散布影响了人们正常社交。为此,提出一种垃圾文本过滤模型。通过BERT模型提取文本的句编码,采用B-Feature方法对句编码进行特征构造,并根据文本与所得特征之间的联系进一步将该特征构造为特征矩阵,运用BP神经网络分类器对特征矩阵进行处理,检测出垃圾文本并进行过滤。实验结果表明,该模型在长、中、短文本数据集上的准确率较TFIDF-BP模型分别提高7.8%、3.8%和11.7%,在中、短文本数据集上的准确率较朴素贝叶斯模型分别提高2.1%和13.7%,能有效对垃圾文本进行分类和过滤。

关键词: BERT模型, 特征构造, BP神经网络, 垃圾文本过滤, 文本分类, 句编码

CLC Number: