作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (6): 88-90. doi: 10.3969/j.issn.1000-3428.2009.06.030

• 软件技术与数据库 • 上一篇    下一篇

一种适应短文本的相关测度及其应用

何海江   

  1. (长沙学院计算机中心,长沙 410003)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-03-20 发布日期:2009-03-20

Relevancy Coefficient and Its Application Adapted to Short Texts

HE Hai-jiang   

  1. (Computer Center, Changsha University, Changsha 410003)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-03-20 Published:2009-03-20

摘要: 针对博客社区和BBS论坛充斥Web垃圾信息的问题,提出相关度向量空间模型cVSM,并以此作为评论的特征,采用支持向量机分类算法自动识别垃圾评论。cVSM包括一种适合短文本的相关测度,用于衡量评论和文章的语义相关程度。在中文博客测试集和中文BBS测试集上的实验结果表明,相比纯粹使用评论文本特征的方法,应用该模型时F1至少提高6%。

关键词: 博客, 垃圾评论, 支持向量机, 文本挖掘, 相关测度

Abstract: A relevancy coefficient vectort space model named cVSM is proposed to aim at Web spams which flood in blogosphere and forums. The cVSM whose components are employed as features of comments and the support vector machine classification algorithms are used to automatically identify comment spams. The relevancy coefficient included in the cVSM is presented, which is used to evaluate relevancy grade of posts and comments. Chinese blog dataset and Chinese BBS dataset are tested. Experimental results show that compared with traditional method the F1 has been improved at least 6% by this way.

Key words: blog, comment spam, support vector machine, text mining, relevancy coefficient

中图分类号: