Abstract:
This paper proposes a novel term extraction method based on repeats, which can extract meaningful terms from text. For Chinese, it need not word segmentation. Experimental results show that the proposed approach can remarkably reduce the dimensionality and effectively improve the performance of traditional clustering algorithms.
Key words:
Text clustering,
Term extraction,
Repeats
摘要: 针对Web文档的高维问题及网络新语言给现有分词系统带来的挑战,该文提出一种基于重复串的特征提取方法,可以从文本中提取有意义的特征,且对于中文无需分词。实验表明,该方法可以降低特征空间维度,同时能有效改善传统以词为特征的聚类算法的性能。
关键词:
文本聚类,
特征提取,
重复串
HU Jixiang; ; XU Hongbo; LIU Yue; CHENG Xueqi. Algorithm of Repeats-based Term Extraction and Its Application in Text Clustering[J]. Computer Engineering, 2007, 33(02): 65-67.
胡吉祥;许洪波;刘 悦;程学旗. 重复串特征提取算法及其在文本聚类中的应用[J]. 计算机工程, 2007, 33(02): 65-67.