作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (02): 65-67. doi: 10.3969/j.issn.1000-3428.2007.02.022

• 软件技术与数据库 • 上一篇    下一篇

重复串特征提取算法及其在文本聚类中的应用

胡吉祥1,2,许洪波1,刘 悦1,程学旗1   

  1. (1. 中国科学院计算技术研究所,北京 100080;2. 中国科学院研究生院,北京 100039)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-01-20 发布日期:2007-01-20

Algorithm of Repeats-based Term Extraction and Its Application in Text Clustering

HU Jixiang1, 2, XU Hongbo1, LIU Yue1, CHENG Xueqi1   

  1. (1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080; 2. Graduate School, Chinese Academy of Sciences, Beijing 100039)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-01-20 Published:2007-01-20

摘要: 针对Web文档的高维问题及网络新语言给现有分词系统带来的挑战,该文提出一种基于重复串的特征提取方法,可以从文本中提取有意义的特征,且对于中文无需分词。实验表明,该方法可以降低特征空间维度,同时能有效改善传统以词为特征的聚类算法的性能。

关键词: 文本聚类, 特征提取, 重复串

Abstract: This paper proposes a novel term extraction method based on repeats, which can extract meaningful terms from text. For Chinese, it need not word segmentation. Experimental results show that the proposed approach can remarkably reduce the dimensionality and effectively improve the performance of traditional clustering algorithms.

Key words: Text clustering, Term extraction, Repeats