摘要: 现有的数据流聚类算法无法处理高维混合属性的数据流。针对该问题,对HPStream算法的脱机聚类和联机聚类过程进行改进,利用频度矩阵处理名词属性,通过基于信息熵的名词属性选择方法降低数据维度。实验结果表明,该算法能有效处理混合属性和维度较高的数据集,与HPStream算法相比,聚类精度有5%~15%的提高。
关键词:
数据流挖掘,
混合属性,
频度矩阵,
信息熵,
降维
Abstract: Existed data stream clustering algorithms can not deal with the data stream with high-dimensional heterogeneous attributes. To address the problem, this paper improves the off-line process and the on-line process of HPStream algorithm, which uses frequency matrix to handle the categorical attributes and uses the principle of information entropy to handle the problem of high dimension. Experimental results show that the algorithm can manipulate heterogeneous attributes and high-dimensional data sets. Compared with the HPStream algorithm, its clustering precision is increased by 5% ~15%.
Key words:
data stream mining,
heterogeneous attributes,
frequency matrix,
information entropy,
dimension reduction
中图分类号:
谭建建, 郑洪源, 丁秋林. 基于信息熵降维的混合属性数据流聚类算法[J]. 计算机工程, 2011, 37(19): 82-84,87.
TAN Jian-Jian, ZHENG Hong-Yuan, DING Qiu-Lin. Clustering Algorithm for Data Stream with Heterogeneous Attributes Based on Information Entropy Dimension Reduction[J]. Computer Engineering, 2011, 37(19): 82-84,87.