作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (20): 41-44. doi: 10.3969/j.issn.1000-3428.2012.20.011

• 软件技术与数据库 • 上一篇    下一篇

中文微博数据净化算法比较研究

邹鸿程,周 刚,杨亚强,李旭东   

  1. (信息工程大学信息工程学院,郑州 450002)
  • 收稿日期:2011-12-13 修回日期:2012-02-09 出版日期:2012-10-20 发布日期:2012-10-17
  • 作者简介:邹鸿程(1985-),男,硕士研究生,主研方向:数据挖掘;周 刚,副教授、博士;杨亚强,本科生;李旭东,硕士
  • 基金资助:

    国家“863”计划基金资助项目(2009AA043303);软件开发环境国家重点实验室开放课题基金资助项目(SKLSDE-2011KF-06)

Comparative Study of Chinese Microblog Data Cleansing Algorithm

ZOU Hong-cheng, ZHOU Gang, YANG Ya-qiang, LI Xu-dong   

  1. (Institute of Information Engineering, Information Engineering University, Zhengzhou 450002, China)
  • Received:2011-12-13 Revised:2012-02-09 Online:2012-10-20 Published:2012-10-17

摘要: 针对微博语言口语化和不规范导致微博数据质量低下的问题,利用质心、度-中心值和特征向量-中心值3种算法对微博话题数据进行净化,从而提高数据质量。通过比较净化前后话题帖子的规范性、相关性和有益性等属性指标分析算法性能。实验结果表明,经过 3种净化算法处理,话题帖子的整体质量尤其是规范性指标均有所提高,质心算法对于有益性指标有较好的净化效果,度-中心值和特征向量-中心值算法有助于得到强相似度的话题帖子。

关键词: 微博, 质量指标, 过滤, 中心值, 数据净化

Abstract: Aiming at the problem of low quality of microblog data causing by colloquialism and non-normative of microblog language, this paper conducts cleansing microblog data using the centroid algorithm, degree-epicenter value and eigenvector-epicenter value algorithms in order to acquire high quality data, analyses the performance of the three cleansing algorithms after comparing the three attribute metric values such as normativity, relevance, helpfulness before and after cleansing process. Experimental result shows that the metric values of the whole quality of posts ascend obviously after processing by three cleansing algorithms, centroid algorithm shows better performance for attribute helpfulness, the degree-epicenter value and eigenvector-epicenter value algorithms help to acquire posts with strong similarity.

Key words: microblog, quality indicator, filtering, epicenter value, data cleansing

中图分类号: