计算机工程 ›› 2018, Vol. 44 ›› Issue (7): 212-218.doi: 10.19678/j.issn.1000-3428.0047259

• 人工智能及识别技术 • 上一篇    下一篇

基于Co-Training的微博垃圾评论识别方法

李志欣 1,2,兰丹媚 1,2,张灿龙 1,2,唐素勤 1,2   

  1. 1.广西师范大学 广西多源信息挖掘与安全重点实验室,广西 桂林 541004; 2.广西区域多源信息集成与智能处理协同创新中心,广西 桂林 541004
  • 收稿日期:2017-05-18 出版日期:2018-07-15 发布日期:2018-07-15
  • 作者简介:李志欣(1971—),男,教授、博士,主研方向为数据挖掘、图像理解、机器学习;兰丹媚,硕士研究生;张灿龙,副教授、博士;唐素勤,教授、博士。
  • 基金项目:

    国家自然科学基金(61663004,61363035,61365009);广西自然科学基金(2016GXNSFAA380146,2017GXNSFAA198365);广西多源信息挖掘与安全重点实验室主任基金(16-A-03-02);广西学位与研究生教育改革专项课题(JGY2015031)。

Recognition Method of Microblogging Spam Comment Based on Co-Training

LI Zhixin  1,2,LAN Danmei  1,2,ZHANG Canlong  1,2,TANG Suqin  1,2   

  1. 1.Guangxi Key Lab of Multi-source Information Mining and Security,Guangxi Normal University,Guilin,Guangxi 541004,China; 2.Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing,Guilin,Guangxi 541004,China
  • Received:2017-05-18 Online:2018-07-15 Published:2018-07-15

摘要:

微博上大量的垃圾评论对个人、社会,甚至是对国家都会造成不良影响。为对微博中的垃圾评论进行识别,提出基于协同训练的微博垃圾评论识别方法。定义一种基于规则的识别方法过滤出显式垃圾评论,剩余的评论归为相关评论,构建AdaBoost分类器和支持向量机分类器,通过Co-Training算法进行协同训练,判断其是否为垃圾评论,以提高分类精度,节省样本标注工作。实验结果表明,与基于相似度计算的垃圾评论识别方法、基于评论多特征的垃圾评论识别方法相比,该方法具有较好的识别效果。

关键词: 微博垃圾评论, 协同训练, 同义词词林, 支持向量机, 相似度计算

Abstract:

A large amount of spam comments on microblogging will have an adverse effect on individuals,society,and even the country.In order to identify junk comments in microblogging and reduce junk comments,a microblogging junk comment review method based on collaborative training is proposed.Define a rule-based recognition method to filter out explicit spam comments.The remaining comments are categorized as related comments.The AdaBoost classifier and Support Vector Machine(SVM) classifier are constructed.The Co-Training algorithm is used for collaborative training to determine whether it is a spam comment or not,classification accuracy,saving sample labeling work.Experimental results show that compared with the spam comment recognition method based on similarity calculation and the multi-features comment spam recognition method,this method has a better recognition effect.

Key words: microblogging spam comment, collaborative training, synonym word forest, Support Vector Machine(SVM), similarity computation

中图分类号: