作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (18): 39-41. doi: 10.3969/j.issn.1000-3428.2008.18.014

• 软件技术与数据库 • 上一篇    下一篇

基于向量空间模型的文本聚类算法

姚清耘,刘功申,李 翔

  

  1. (上海交通大学信息安全工程学院,上海 200240)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-09-20 发布日期:2008-09-20

VSM–based Text Clustering Algorithm

YAO Qing-yun, LIU Gong-shen, LI Xiang   

  1. (School of Information Security Engineering, Shanghai Jiaotong University, Shanghai 200240)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-09-20 Published:2008-09-20

摘要: 文本聚类是聚类的一个重要研究分支,是聚类方法在文本处理领域的应用。该文探讨了基于向量空间模型的文本聚类方法,提出了一种文本聚类的改进算法——LP算法。同时,基于语料库的实际聚类效果,就维度确定、特征选择等方面提出优化方案。实验证明,LP算法有效地减少了聚类所消耗的时间,实用性和灵活性都较高。

关键词: 向量空间模型, 文本聚类, 语料库

Abstract: Text clustering, one of the most important research braches of clustering, is the application of clustering algorithm in text processing. This paper discusses different Vector Space Model(VSM)-based clustering algorithms and presents an improved text clustering algorithm——Level-Panel(LP) algorithm. In addition, according to the effects of clustering for the corpus, it presents optimizations of clustering algorithm, including dimension determining, feature selection, etc. It is proved that LP algorithm can effectively reduce the time spending in clustering process. It is high in practicability and flexibility.

Key words: Vector Space Model(VSM), text clustering, corpus

中图分类号: