Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2007, Vol. 33 ›› Issue (14): 24-26,3.

• Degree Paper • Previous Articles     Next Articles

Text Clustering Approach Based on Content Characteristics

LI Xiaoguang 1, SONG Baoyan 1, YU Ge 2, WANG Daling 2   

  1. (1. School of Information Science and Technology, Liaoning University, Shenyang 110036; 2. School of Information Science and Engineering, Northeastern University, Shenyang 110004)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-07-20 Published:2007-07-20

一种基于内容特性的文本聚类方法

李晓光1,宋宝燕1,于 戈2,王大玲2   

  1. (1. 辽宁大学信息科学与技术学院,沈阳 110036;2. 东北大学信息科学与工程学院,沈阳 110004)

Abstract: The fitness of cluster model to data distribution is critical to probabilistic-model-based clustering. The single-component model fails to capture the distribution of document data completely because of the complexity of content-based distribution of document. This paper considers the characteristics of document are influenced mainly by two components: topic and general writting style, proposes the content-based cluster model mixed by topic model and general model, and gives the document clustering algorithm. Experimental results indicate that the content-based cluster model shows better fitness than single-component model and gets better quality of clustering.

Key words: clustering, probabilistic-model-based clustering, mixture model, EM algoritlim

摘要: 在基于概率模型的聚类中,簇模型对数据分布的拟合性直接影响着聚类质量。基于内容的文本数据分布的复杂性导致单一因素的簇模型无法准确拟合文本数据的分布特征。该文认为文本基于内容的分布特性主要受主题内容和通用写作方式影响,给出了一种基于主题模型和通用模型的混合簇模型和基于该簇模型的文本聚类方法。实验表明该聚类方法较单一因素的簇模型具有更好的拟合性,聚类质量 更好。

关键词: 聚类, 基于概率模型的聚类, 混合模型, EM子方法

CLC Number: