基于Co-Training的微博垃圾评论识别方法

doi:10.19678/j.issn.1000-3428.0047259

计算机工程 ›› 2018, Vol. 44 ›› Issue (7): 212-218. doi: 10.19678/j.issn.1000-3428.0047259

基于Co-Training的微博垃圾评论识别方法

李志欣^1,2,兰丹媚 ^1,2,张灿龙 ^1,2,唐素勤^1,2

1.广西师范大学广西多源信息挖掘与安全重点实验室,广西桂林 541004; 2.广西区域多源信息集成与智能处理协同创新中心,广西桂林 541004

收稿日期:2017-05-18 出版日期:2018-07-15 发布日期:2018-07-15
作者简介:李志欣(1971—),男,教授、博士,主研方向为数据挖掘、图像理解、机器学习;兰丹媚,硕士研究生;张灿龙,副教授、博士;唐素勤,教授、博士。
基金资助:
国家自然科学基金(61663004,61363035,61365009);广西自然科学基金(2016GXNSFAA380146,2017GXNSFAA198365);广西多源信息挖掘与安全重点实验室主任基金(16-A-03-02);广西学位与研究生教育改革专项课题(JGY2015031)。

Recognition Method of Microblogging Spam Comment Based on Co-Training

LI Zhixin ^1,2,LAN Danmei ^1,2,ZHANG Canlong ^1,2,TANG Suqin ^1,2

1.Guangxi Key Lab of Multi-source Information Mining and Security,Guangxi Normal University,Guilin,Guangxi 541004,China; 2.Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing,Guilin,Guangxi 541004,China

Received:2017-05-18 Online:2018-07-15 Published:2018-07-15

摘要/Abstract

摘要：

微博上大量的垃圾评论对个人、社会,甚至是对国家都会造成不良影响。为对微博中的垃圾评论进行识别,提出基于协同训练的微博垃圾评论识别方法。定义一种基于规则的识别方法过滤出显式垃圾评论,剩余的评论归为相关评论,构建AdaBoost分类器和支持向量机分类器,通过Co-Training算法进行协同训练,判断其是否为垃圾评论,以提高分类精度,节省样本标注工作。实验结果表明,与基于相似度计算的垃圾评论识别方法、基于评论多特征的垃圾评论识别方法相比,该方法具有较好的识别效果。

关键词: 微博垃圾评论, 协同训练, 同义词词林, 支持向量机, 相似度计算

Abstract:

A large amount of spam comments on microblogging will have an adverse effect on individuals,society,and even the country.In order to identify junk comments in microblogging and reduce junk comments,a microblogging junk comment review method based on collaborative training is proposed.Define a rule-based recognition method to filter out explicit spam comments.The remaining comments are categorized as related comments.The AdaBoost classifier and Support Vector Machine(SVM) classifier are constructed.The Co-Training algorithm is used for collaborative training to determine whether it is a spam comment or not,classification accuracy,saving sample labeling work.Experimental results show that compared with the spam comment recognition method based on similarity calculation and the multi-features comment spam recognition method,this method has a better recognition effect.

Key words: microblogging spam comment, collaborative training, synonym word forest, Support Vector Machine(SVM), similarity computation

中图分类号:

TP391

李志欣,兰丹媚,张灿龙,唐素勤. 基于Co-Training的微博垃圾评论识别方法[J]. 计算机工程, 2018, 44(7): 212-218.

LI Zhixin,LAN Danmei,ZHANG Canlong,TANG Suqin. Recognition Method of Microblogging Spam Comment Based on Co-Training[J]. Computer Engineering, 2018, 44(7): 212-218.

http://www.ecice06.com/CN/Y2018/V44/I7/212

参考文献

［1］丁兆云,贾焰,周斌.微博数据挖掘研究综述［J］.计算机研究与发展,2014,51(4):691-706.
［2］LIU B.Web data mining:exploring hyperlinks,contents,and usage data ［M］.Berlin,Germany:Springer,2009.
［3］杨亮,许侃,林鸿飞,等.博客作者声誉度分析［J］.计算机科学与探索,2013,7(9):838-847.
［4］杨风雷,黎建辉.用户生成内容中的垃圾意见研究综述［J］.计算机应用研究,2011,28(10):3601-3605.
［5］JINDAL N,LIU B.Review spam detection［C］//Proceedings of IEEE International Conference on World Wide Web.Washington D.C.,USA:IEEE Press,2007:1189-1190.
［6］JINDAL N,LIU B.Opinion spam and analysis［C］//Proceedings of IEEE International Conference on Web Search and Data Mining.Washington D.C.,USA:IEEE Press,2008:219-230.
［7］邓冰娜,王煜,刘宇.一种应用于博客的垃圾评论识别方法［J］.郑州大学学报(理学版),2011,43(1):65-69.
［8］黄铃,李学明.基于AdaBoost的微博垃圾评论识别方法［J］.计算机应用,2013,33(12):3563-3566.
［9］LAI C L,XU K Q,LAU R Y K,et al.High-order concept associations mining and inferential language modeling for online review spam detection［C］//Proceedings of IEEE International Conference on Data Mining Workshops.Washington D.C.,USA:IEEE Press,2010:1120-1127.
［10］刁宇峰,杨亮,林鸿飞.基于LDA模型的博客垃圾评论发现［J］.中文信息学报,2011,25(1):41-47.
［11］SURENDRA S,AIXIN S.Hspam14:a collection of 14 million tweets for hashtag-oriented spam research［C］//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2015:223-232.
［12］姚子瑜,屠守中,黄民烈,等.一种半监督的中文垃圾微博过滤方法［J］.中文信息学报,2016,30(5):176-186.
［13］FREUND Y,SCHAPIRE R E.A decision-theoretic generalization of on-line learning and an application to boosting［C］//Proceedings of European Conference on Computational Learning Theory.Berlin,Germany:Springer,1995:23-27.
［14］YAROWSKY D.Unsupervised word sense disambiguation rivaling supervised methods［C］//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics,Washington D.C.,USA:IEEE Press,1995:189-196.
［15］NIGAM K,MCCALLUM A K,THRUN S,et al.Text classification from labeled and unlabeled documents using EM［J］.Machine Learning,2000,39(2):103-134.
［16］ZHOU Z H,LI M.Tri-training:exploiting unlabeled data using three classifiers［J］.IEEE Transactions on Knowledge & Data Engineering,2005,17(11):1529-1541.
［17］BREIMAN L.Random forests［J］.Machine Learning,2001,45(1):5-32.
［18］ZHOU Z H,ZHAN D C,YANG Q.Semi-supervised learning with very few labeled training examples［C］//Proceedings of AAAI Conference on Artificial Intelligence.［S.1.］:AAAI Press,2007:675-680.
［19］LI M,ZHOU Z H.Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples［J］.IEEE Transactions on Systems,Man,and Cybernetics,Part A,2007,37(6):1088-1098.
［20］田久乐,赵蔚.基于同义词词林的词语相似度计算方法［J］.吉林大学学报(信息科学版),2010,28(6):602-608.
［21］张剑峰,夏云庆,姚建民.微博文本处理研究综述［J］．中文信息学报,2012,26(4):21-27.
［22］CHANG C C,LIN C J.LIBSVM:A library for support vector machines［J］.ACM Transactions on Intelligent Systems & Technology,2007,2(3):27-33.

[1]	王志江, 秦品乐, 柴锐, 武峰, 程一彤, 史玥. 基于深度学习的牙齿嵌塞自动判别方法[J]. 计算机工程, 2022, 48(4): 307-313.
[2]	王海, 翁晨傲, 李克, 骆曦. 一种面向基站扇区方向角估计的改进SVM算法[J]. 计算机工程, 2021, 47(4): 120-126.
[3]	张冰玉, 潘晴, 田妮莉, Everett Xiaolin Wang. 一种基于多重特征融合的信源个数估计方法[J]. 计算机工程, 2021, 47(4): 115-119,126.
[4]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[5]	连晓伟, 马垚, 陈永乐, 张壮壮, 王建华. 基于载荷特征与统计特征的Shodan流量识别[J]. 计算机工程, 2021, 47(1): 117-122.
[6]	袁哲明, 杨晶晶, 陈渊. 基于最大信息系数与冗余分摊的特征选择方法[J]. 计算机工程, 2020, 46(8): 101-105.
[7]	付子爔, 徐洋, 吴招娣, 许丹丹, 谢晓尧. 基于增量学习的SVM-KNN网络入侵检测方法[J]. 计算机工程, 2020, 46(4): 115-122.
[8]	杨海清, 范琦. 基于时空分析的路口相似度计算方法[J]. 计算机工程, 2020, 46(4): 33-39.
[9]	张瑞, 陈红卫. 基于特征优化与SVPSO的工控入侵检测[J]. 计算机工程, 2020, 46(4): 19-25.
[10]	许勇, 刘井平, 肖仰华, 朱慕华. 基于协同训练的电商领域短语挖掘[J]. 计算机工程, 2020, 46(4): 70-76,84.
[11]	鲁淑霞, 蔡莲香, 张罗幻. 基于动量加速零阶减小方差的鲁棒支持向量机[J]. 计算机工程, 2020, 46(12): 88-95,104.
[12]	张波, 周从华, 张付全, 张婷, 蒋跃明. 一种面向SNP选择的模糊聚类算法[J]. 计算机工程, 2019, 45(8): 66-74.
[13]	周梦妮, 牛焱, 曹锐, 阎鹏飞, 相洁. 基于相位同步的癫痫信号识别与分析[J]. 计算机工程, 2019, 45(7): 291-295,302.
[14]	易国洪,代瑜,冯智莉,黎慧源. 基于SVM与DOM重心半径模型的Web正文提取[J]. 计算机工程, 2019, 45(6): 206-210.
[15]	潘成胜,刘勇,石怀峰,杨力. SDN架构下的空间信息网络业务识别技术[J]. 计算机工程, 2019, 45(4): 18-24.

选择文件类型/文献管理软件名称

选择包含的内容

基于Co-Training的微博垃圾评论识别方法

Recognition Method of Microblogging Spam Comment Based on Co-Training

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于Co-Training的微博垃圾评论识别方法

Recognition Method of Microblogging Spam Comment Based on Co-Training

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价