计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

融合词项关联关系的半监督微博聚类算法

马慧芳,贾美惠子,袁 媛,张志昌   

  1. (西北师范大学计算机科学与工程学院,兰州730070)
  • 收稿日期:2014-06-03 出版日期:2015-05-15 发布日期:2015-05-15
  • 作者简介:马慧芳(1981 - ),女,副教授、博士,主研方向:人工智能,数据挖掘,机器学习;贾美惠子、袁 媛,硕士研究生;张志昌,副教 授、博士。
  • 基金项目:
    国家自然科学基金资助项目(61163039,61363058);甘肃省教育厅基金资助项目(2013A-016)。

Semi-supervised Microblog Clustering Algorithm Fused with Term Correlation Relationship

MA Huifang,JIA Meihuizi,YUAN Yuan,ZHANG Zhichang   

  1. (College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China)
  • Received:2014-06-03 Online:2015-05-15 Published:2015-05-15

摘要: 针对微博文本内容短、稀疏、高维等特点,提出一种改进的半监督微博聚类算法。该算法利用词项间的关 系丰富文本特征,通过定义词项文档间关联关系和词项文档内关联关系揭示词项间语义的关联程度,并由此自动 生成有标记的数据来指导聚类过程。对词项先验信息进行成对约束编码,构建基于词项间成对约束的三重非负矩 阵分解模型来实现微博的半监督聚类。实验结果表明,该算法可以减少繁琐的人工标记过程,并能高效地进行微 博聚类。

关键词: 微博, 词项关联关系, 成对约束, 半监督聚类, 非负矩阵分解

Abstract: A novel semi-supervised learning algorithm fully exploring the inner semantic information to compensate for the limited message length is presented. The key idea is to explore term correlation data,which well captures the semantic information for term weighting and provides greater context for short texts. Direct and indirect dependency weights between terms are defined to reveal the semantic correlation between terms. Must-link and cannot-link are encoded as constraints for terms. This paper formulates microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework,which takes advantage of knowledge of features as pair-wise constraints. Extensive experiments are conducted on two real-world microblog datasets. Experimental results show that the effectiveness of the proposed algorithm. It not only greatly reduces the labor-intensive labeling process,but also deeply exploits the hidden information from microblog itself.

Key words: microblog, term correlation relationship, pair-wise constraints, semi-supervised clustering, non-negative matrix factorization

中图分类号: