Abstract:
Aiming at the problem of low quality of microblog data causing by colloquialism and non-normative of microblog language, this paper conducts cleansing microblog data using the centroid algorithm, degree-epicenter value and eigenvector-epicenter value algorithms in order to acquire high quality data, analyses the performance of the three cleansing algorithms after comparing the three attribute metric values such as normativity, relevance, helpfulness before and after cleansing process. Experimental result shows that the metric values of the whole quality of posts ascend obviously after processing by three cleansing algorithms, centroid algorithm shows better performance for attribute helpfulness, the degree-epicenter value and eigenvector-epicenter value algorithms help to acquire posts with strong similarity.
Key words:
microblog,
quality indicator,
filtering,
epicenter value,
data cleansing
摘要: 针对微博语言口语化和不规范导致微博数据质量低下的问题,利用质心、度-中心值和特征向量-中心值3种算法对微博话题数据进行净化,从而提高数据质量。通过比较净化前后话题帖子的规范性、相关性和有益性等属性指标分析算法性能。实验结果表明,经过 3种净化算法处理,话题帖子的整体质量尤其是规范性指标均有所提高,质心算法对于有益性指标有较好的净化效果,度-中心值和特征向量-中心值算法有助于得到强相似度的话题帖子。
关键词:
微博,
质量指标,
过滤,
中心值,
数据净化
CLC Number:
JU Hong-Cheng, ZHOU Gang, YANG E-Jiang, LI Xu-Dong. Comparative Study of Chinese Microblog Data Cleansing Algorithm[J]. Computer Engineering, 2012, 38(20): 41-44.
邹鸿程, 周刚, 杨亚强, 李旭东. 中文微博数据净化算法比较研究[J]. 计算机工程, 2012, 38(20): 41-44.