作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

一种基于谱分割的短文本聚类算法

李晓红1,谢蒙1,马慧芳1,何廷年1,2   

  1. (1.西北师范大学 计算机科学与工程学院,兰州 730070; 2.北京师范大学 信息科学与技术学院,北京 100875)
  • 收稿日期:2015-07-13 出版日期:2016-08-15 发布日期:2016-08-15
  • 作者简介:李晓红(1978-),女,讲师,主研方向为数据挖掘;谢蒙,硕士研究生;马慧芳,副教授、博士;何廷年,副教授、博士研究生。
  • 基金资助:
    国家自然科学基金资助项目(61163039,61363058);甘肃省青年科技基金资助项目(1308TJY085,145RJYA259);中国科学院计算技术研究所智能信息处理重点实验室开放基金资助项目(IIP2014-4)。

A Short Text Clustering Algorithm Based on Spectral Cut

LI Xiaohong  1,XIE Meng  1,MA Huifang  1,HE Tingnian  1,2   

  1. (1.College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China; 2.College of Information Science and Technology,Beijing Normal University,Beijing 100875,China)
  • Received:2015-07-13 Online:2016-08-15 Published:2016-08-15

摘要: 短文本具有稀疏高维的特点,现有聚类算法在大规模短文本上的聚类精度较低且效率低下。针对该问题,提出一种以谱聚类理论作支撑,基于谱分割准则RMcut的新聚类算法。依据谱聚类理论,将短文本集合构建成一张带权无向图,并计算得到文档-文档的相似度矩阵,为聚类算法提供信息。不断迭代地用2-way方式划分该图,划分过程中使用RMcut值作为划分是否终止的条件,利用Prim算法将原图中的顶点加入到聚族中,以得到质量较高的聚类结果。实验结果表明,该算法具有较高的时间性能,与K-means算法、词共现聚类算法及基于免疫的聚类算法相比,聚类结果更准确。

关键词: 短文本, 相似度矩阵, 无向带权图, RMcut准则, 聚类算法

Abstract: Short text has the characteristics of sparsity and high dimension,and the existing clustering algorithm for the large-scale short text has low accuracy and efficiency.Aiming at this problem,a novel clustering method based on spectral clustering theory and spectral cut standard RMcut is proposed.According to spectral clustering theory,short text collection is constructed into a weighted undirected graph,and a document similarity matrix is constructed by calculating the similarity,which provides all information for the clustering algorithm.Two-way method is used to partition the graph into two parts iteratively.RMcut is used as the termination condition in the process of partitioning,and Prim algorithm is utilized to add nodes in the original graph into clusters for the purpose of obtaining high-quality clustering results.Experimental results demonstrate that this algorithm has high time performance and shows better clustering results than other algorithms,such as K-means algorithm,word co-occurrence clustering algorithm and immunity-based clustering algorithm.

Key words: short text, similarity matrix, undirected weighted graph, RMcut criterion, clustering algorithm

中图分类号: