基于特征矩阵构造与BP神经网络的垃圾文本过滤模型

doi:10.19678/j.issn.1000-3428.0055414

摘要/Abstract

摘要： 在网络社交平台海量的信息文本中含有许多垃圾文本，这些文本的广泛散布影响了人们正常社交。为此，提出一种垃圾文本过滤模型。通过BERT模型提取文本的句编码，采用B-Feature方法对句编码进行特征构造，并根据文本与所得特征之间的联系进一步将该特征构造为特征矩阵，运用BP神经网络分类器对特征矩阵进行处理，检测出垃圾文本并进行过滤。实验结果表明，该模型在长、中、短文本数据集上的准确率较TFIDF-BP模型分别提高7.8%、3.8%和11.7%，在中、短文本数据集上的准确率较朴素贝叶斯模型分别提高2.1%和13.7%，能有效对垃圾文本进行分类和过滤。

关键词: BERT模型, 特征构造, BP神经网络, 垃圾文本过滤, 文本分类, 句编码

Abstract: There are a lot of junk texts in the massive information of online social platforms,which hinder the normal social intercourse of people when they are widely spread.To address the problem,this paper proposes a junk text filtering model.The model uses the BERT model to extract sentence coding of the text.Then the feature of sentence coding is constructed by using the B-Feature method,and the obtained feature is further constructed as a feature matrix based on the relationship between the feature and the text.The feature matrix is processed by using a BP neural network classifier,and junk texts are detected and filtered.Experimental results show that the accuracy rate of the proposed model on text datasets of long,medium,and short length is respectively 7.8%,3.8% and 11.7% higher than that of the TFIDF-BP model,and the accuracy of the proposed model on text datasets of medium and short length is respectively 2.1% and 13.7% higher than that of the naive Bayes model,which can effectively classify and filter junk texts.

Key words: BERT model, feature construction, BP neural network, junk text filtering, text classification, sentence coding

中图分类号:

TP391.1

方瑞, 于俊洋, 董李锋. 基于特征矩阵构造与BP神经网络的垃圾文本过滤模型[J]. 计算机工程, 2020, 46(8): 271-276.

FANG Rui, YU Junyang, DONG Lifeng. Junk Text Filtering Model Based on Feature Matrix Construction and BP Neural Network[J]. Computer Engineering, 2020, 46(8): 271-276.

https://www.ecice06.com/CN/Y2020/V46/I8/271

图/表 12

20200819141017

20200819141020

20200819141023

20200819141026

20200819141030

20200819141033

20200819141035

20200819141038

20200819141041

20200819141045

20200819141048

20200819141050

参考文献

[1] LIAN Jianxun.Personalized recommender systems with diversified data[D].Hefei:University of Science and Technology of China,2018.(in Chinese)练建勋.基于多样化内容数据的个性化推荐系统[D].合肥:中国科学技术大学,2018.
[2] WU Dengfeng,BAI Lin,WANG Tao,et al.Document recommendation system based on multi-granularity features and hybrid algorithms[J].Computer Systems and Applications,2018,27(3):9-17.(in Chinese)邬登峰,白琳,王涛,等.基于多粒度特征和混合算法的文档推荐系统[J].计算机系统应用,2018,27(3):9-17.
[3] HE Yun,LI Tong,WANG Wei,et al.A semantic similarity integration method for software feature location problem[J].Journal of Computer Research and Development,2019,56(2):394-409.(in Chinese)何云,李彤,王炜,等.一种面向软件特征定位问题的语义相似度集成方法[J].计算机研究与发展,2019,56(2):394-409.
[4] DONG Li,WEI Furu,TAN Chuanqi,et al.Adaptive recursive neural network for target-dependent twitter sentiment classification[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.New York,USA:ACM Press,2014:49-54.
[5] TANG Duyu,QIN Bing,LIU Ting.Aspect level sentiment classification with deep memory network[C]//Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing.Washington D.C.,USA:IEEE Press,2016:214-224.
[6] ZHANG Zhuo,TAN Qingping,MAO Xiaoguang,et al.Effective fault localization approach based on enhanced contexts[J].Journal of Software,2019,30(2):266-281.(in Chinese)张卓,谭庆平,毛晓光,等.增强上下文的错误定位技术[J].软件学报,2019,30(2):266-281.
[7] SUN Guangfu,WU Le,LIU Qi,et al.Recommendations based on collaborative filtering by exploiting sequential behaviors[J].Journal of Software,2013,24(11):2721-2733.(in Chinese)孙光福,吴乐,刘淇,等.基于时序行为的协同过滤推荐算法[J].软件学报,2013,24(11):2721-2733.
[8] RUI Yan."Chitty-chitty-chat bot":deep learning for conversational AI[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.Washington D.C.,USA:IEEE Press,2018:5520-5526.
[9] WANG Z G,HAMZA W,FLORIAN R.Bilateral multi-perspective matching for natural language sentences[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence.Washington D.C.,USA:IEEE Press,2017:4144-4150.
[10] ZHANG Xu,ZHOU Xinzhi,ZHAO Chengping,et al.Unbalanced data classification based on hesitant fuzzy decision tree[J].Computer Engineering,2019,45(8):75-79.(in Chinese)张旭,周新志,赵成萍,等.基于犹豫模糊决策树的非均衡数据分类[J].计算机工程,2019,45(8):75-79.
[11] BOWMAN S R,VILNIS L,VINYALS O,et al.Generating sentences from a continuous space[C]//Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning.Berlin,Germany:Springer,2016:10-21.
[12] ZHANG Zhuosheng,LI Jiangtong,ZHU Pengfei,et al.Modeling multi-turn conversation with deep utterance aggregation[C]//Proceedings of International Conference on Computational Linguistics.Washington D.C.,USA:IEEE Press,2018:3740-3752.
[13] SHEN Xiaoyu,SU Hui,NIU Shuzi,et al.Language modeling[C]//Proceedings of Awakening-Sleep Change Lawton Coder International Conference on Neural Network Information Processing.Washington D.C.,USA:IEEE Press,2017:405-414.
[14] XIANG Taoran,YE Xiaochun,LI Wenming,et al.Accelerating fully connected layers of sparse neural networks with fine-grained dataflow architectures[J].Journal of Computer Research and Development,2019,56(6):1192-1204.(in Chinese)向陶然,叶笑春,李文明,等.基于细粒度数据流架构的稀疏神经网络全连接层加速[J].计算机研究与发展,2019,56(6):1192-1204.
[15] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2019-06-01].https://arxiv.org/abs/1810.04805.
[16] LIU Yi,CAO Jianjun,DIAO Xingchun,et al.Survey on stability of feature selection[J].Journal of Software,2018,29(9):2559-2579.(in Chinese)刘艺,曹建军,刁兴春,等.特征选择稳定性研究综述[J].软件学报,2018,29(9):2559-2579.
[17] FANG Rongqiang,WANG Jing,YAO Zhicheng,et al.Modeling computational feature of multi-layer neural network[J].Journal of Computer Research and Development,2019,56(6):1170-1181.(in Chinese)方荣强,王晶,姚治成,等.多层神经网络算法的计算特征建模方法[J].计算机研究与发展,2019,56(6):1170-1181.
[18] MOORE R C.Fast and accurate sentence alignment of bilingual corpora[EB/OL].[2019-06-01].https://www.researchgate.net/publication/220839671_Fast_and_Accurate_Sentence_Alignment_of_Bilingual_Corpora.
[19] SHEN Zejun,YANG Wenyuan.BP neural network via granular computing for prediction of financial trends[J].Journal of Chinese Computer Systems,2019,40(3):527-532.(in Chinese)沈泽君,杨文元.粒计算思维下的BP神经网络在金融趋势预测中的应用[J].小型微型计算机系统,2019,40(3):527-532.
[20] ZHANG Shirui,LI Xinke.An estimation algorithm of BP neural network hidden layer nodes based on simulated annealing[J].Journal of Hefei University of Technology(Natural Science Edition),2017,40(11):1489-1491.(in Chinese)张世睿,李心科.基于模拟退火的BP网络隐藏层节点估算算法[J].合肥工业大学学报(自然科学版),2017,40(11):1489-1491.

选择文件类型/文献管理软件名称

选择包含的内容