摘要: 提出一种自动文本聚类方法,应用遗传算法进行全局和快速的文本特征项选择以实现降维处理,引入概率匿名思想,根据文本中不同特征项权重的组合,基于动态规划设计一个优化的多项式时间聚类算法,将文本集划分成适当个数的分区,并对每个分区进行聚类,从而形成初始聚类,采用相同方法对所有初始聚类进行再聚类,形成最终的文本聚类。实验结果表明,该方法既能实现文本特征项的有效选择,又能较好地改善文本聚类效果和性能。
关键词:
文本聚类,
遗传算法,
特征项选择,
特征项权重分解
Abstract: This paper introduces a novel automatic text clustering method, in which the Genetic Algorithm(GA) is applied to the global optimal and high searching efficient feature selection to achieve dimensionality reduction, then appropriate number of partitions of document set are created according to the different combinations of feature weights, and each document partition is clustered into an initial clusters based on dynamic programming technique, and all initial clusters are clustered using the same method to final text clusters. Experimental results show the method can achieve dimensionality reduction efficiently, improve the text clustering precision, and decrease the clustering time.
Key words:
text clustering,
Genetic Algorithm(GA),
feature item selection,
feature item weight partition
中图分类号:
余永红, 柏文阳. 基于特征项权重自动分解的文本聚类[J]. 计算机工程, 2011, 37(11): 25-27.
TU Yong-Gong, BAI Wen-Yang. Text Clustering Based on Automatic Partition of Feature Item Weight[J]. Computer Engineering, 2011, 37(11): 25-27.