作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于倾斜分布的变流速数据流聚类算法

邢长征,胡权波   

  1. (辽宁工程技术大学电子与信息工程学院,辽宁 葫芦岛 125105)
  • 收稿日期:2013-03-07 出版日期:2013-12-15 发布日期:2013-12-13
  • 作者简介:邢长征(1967-),男,教授、博士,主研方向:人工智能,数据挖掘;胡权波,硕士研究生

Data Stream Clustering Algorithm with Variable Flow Rate Based on Skew Distribution

XING Chang-zheng, HU Quan-bo   

  1. (College of Electronics and Information Engineering, Liaoning Technical University, Huludao 125105, China)
  • Received:2013-03-07 Online:2013-12-15 Published:2013-12-13

摘要: 处理倾斜分布特征的数据流聚类算法TDCA存在聚类速度与内存利用率上的不足,且变流速的数据流环境对聚类结果的质量有严重影响。针对上述问题,提出一种数据流聚类算法GR-Stream。采用网格单元作为数据点的聚集形式,以基于R-tree的扩展数据结构作为组织网格单元的索引结构,在此基础上引入剪枝策略,并调整数据点进入树的方式。在真实数据集KDD-CUP99上进行测试,结果表明,与TDCA算法相比,该算法在聚类过程中可以提高40%的访问速度,应用剪枝策略节省至少一半的内存使用量,同时在变流速的数据流环境下将聚类结果的平均纯度保持在90%以上。

关键词: 数据流, 聚类, 时态密度, 倾斜分布, 剪枝, 变流速

Abstract: The skew distribution characteristics of data stream clustering algorithm TDCA lack of clustering speed and memory utilization. Variable flow rate data stream environment has a serious impact on the quality of the clustering results. In order to deal with the above problems, a data stream clustering algorithm named GR-Stream is presented. It uses grid cells as the aggregation of data points, Based on an extension of the R-tree structure as the organization of grid cell index structure, it introduces pruning strategy on the basis of this structure, and adjusts the way of data points into the tree. It adopts the real dataset the KDD-CUP99 on algorithm test. Experimental results show that, compared with the TDCA algorithm data structure organizing data, this index structure can improve the clustering speed by 40%, and the application of pruning strategy to save at least half memory usage, at the same time maintaining more than 90% of the average purity of the clustering results in the variable flow rate of the data stream environment.

Key words: data stream, clustering, temporal density, skew distribution, pruning, variable flow rate

中图分类号: