作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (19): 82-84,87. doi: 10.3969/j.issn.1000-3428.2011.19.026

• 软件技术与数据库 • 上一篇    下一篇

基于信息熵降维的混合属性数据流聚类算法

谭建建,郑洪源,丁秋林   

  1. (南京航空航天大学信息科学与技术学院,南京 210016)
  • 收稿日期:2011-03-01 出版日期:2011-10-05 发布日期:2011-10-05
  • 作者简介:谭建建(1985-),男,硕士研究生,主研方向:数据挖掘,信息安全;郑洪源,副教授、博士;丁秋林,教授、博士生导师

Clustering Algorithm for Data Stream with Heterogeneous Attributes Based on Information Entropy Dimension Reduction

TAN Jian-jian, ZHENG Hong-yuan, DING Qiu-lin   

  1. (College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)
  • Received:2011-03-01 Online:2011-10-05 Published:2011-10-05

摘要: 现有的数据流聚类算法无法处理高维混合属性的数据流。针对该问题,对HPStream算法的脱机聚类和联机聚类过程进行改进,利用频度矩阵处理名词属性,通过基于信息熵的名词属性选择方法降低数据维度。实验结果表明,该算法能有效处理混合属性和维度较高的数据集,与HPStream算法相比,聚类精度有5%~15%的提高。

关键词: 数据流挖掘, 混合属性, 频度矩阵, 信息熵, 降维

Abstract: Existed data stream clustering algorithms can not deal with the data stream with high-dimensional heterogeneous attributes. To address the problem, this paper improves the off-line process and the on-line process of HPStream algorithm, which uses frequency matrix to handle the categorical attributes and uses the principle of information entropy to handle the problem of high dimension. Experimental results show that the algorithm can manipulate heterogeneous attributes and high-dimensional data sets. Compared with the HPStream algorithm, its clustering precision is increased by 5% ~15%.

Key words: data stream mining, heterogeneous attributes, frequency matrix, information entropy, dimension reduction

中图分类号: