基于倾斜分布的变流速数据流聚类算法

doi:10.3969/j.issn.1000-3428.2013.12.052

计算机工程

基于倾斜分布的变流速数据流聚类算法

邢长征，胡权波

(辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105)

收稿日期:2013-03-07 出版日期:2013-12-15 发布日期:2013-12-13
作者简介:邢长征(1967－)，男，教授、博士，主研方向：人工智能，数据挖掘；胡权波，硕士研究生

Data Stream Clustering Algorithm with Variable Flow Rate Based on Skew Distribution

XING Chang-zheng, HU Quan-bo

(College of Electronics and Information Engineering, Liaoning Technical University, Huludao 125105, China)

Received:2013-03-07 Online:2013-12-15 Published:2013-12-13

摘要/Abstract

摘要： 处理倾斜分布特征的数据流聚类算法TDCA存在聚类速度与内存利用率上的不足，且变流速的数据流环境对聚类结果的质量有严重影响。针对上述问题，提出一种数据流聚类算法GR-Stream。采用网格单元作为数据点的聚集形式，以基于R-tree的扩展数据结构作为组织网格单元的索引结构，在此基础上引入剪枝策略，并调整数据点进入树的方式。在真实数据集KDD-CUP99上进行测试，结果表明，与TDCA算法相比，该算法在聚类过程中可以提高40%的访问速度，应用剪枝策略节省至少一半的内存使用量，同时在变流速的数据流环境下将聚类结果的平均纯度保持在90%以上。

关键词: 数据流, 聚类, 时态密度, 倾斜分布, 剪枝, 变流速

Abstract: The skew distribution characteristics of data stream clustering algorithm TDCA lack of clustering speed and memory utilization. Variable flow rate data stream environment has a serious impact on the quality of the clustering results. In order to deal with the above problems, a data stream clustering algorithm named GR-Stream is presented. It uses grid cells as the aggregation of data points, Based on an extension of the R-tree structure as the organization of grid cell index structure, it introduces pruning strategy on the basis of this structure, and adjusts the way of data points into the tree. It adopts the real dataset the KDD-CUP99 on algorithm test. Experimental results show that, compared with the TDCA algorithm data structure organizing data, this index structure can improve the clustering speed by 40%, and the application of pruning strategy to save at least half memory usage, at the same time maintaining more than 90% of the average purity of the clustering results in the variable flow rate of the data stream environment.

Key words: data stream, clustering, temporal density, skew distribution, pruning, variable flow rate

中图分类号:

TP18

邢长征，胡权波. 基于倾斜分布的变流速数据流聚类算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2013.12.052.

XING Chang-zheng, HU Quan-bo. Data Stream Clustering Algorithm with Variable Flow Rate Based on Skew Distribution[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2013.12.052.

http://www.ecice06.com/CN/Y2013/V39/I12/247

参考文献

(上接第250页) 参考文献 [1] Aggrawal C, Han Jiawei, Wang J, et al. A Framework for Clustering Evolving Data Streams[C]//Proc. of the 29th VLDB Conference. Berlin, Germany: IEEE Computer Society, 2003. [2] Cao Feng, Ester M, Qian Weining, et al. Density-based Clustering over an Evolving Data Stream with Noise[C]//Proc. of SIAM International Conference on Data Mining. Bethesda, USA: Springer, 2006. [3] Chen Yixin. Density-based Clustering for Real-time Stream Data[C]//Proc. of International Conference on Knowledge Discovery and Data Mining. Sacramento, USA: [s. n.], 2007. [4] 胡睿, 林昭文, 柯宏力, 等. 一种基于密度和滑动窗口的数据流聚类算法[J]. 计算机科学, 2011, 38(5): 145-148. [5] 章季阳, 王伦文. 一种领域覆盖的数据流聚类算法[J]. 小型微型计算机系统, 2012, 33(9): 1913-1916. [6] 曹锋, 周傲英. 基于图形处理器的数据流快速聚类[J]. 软件学报, 2007, 30(2): 291-302. [7] Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using Domain Knowledge on a Data Stream[C]//Proc. of the 12th International Conference on Discovery Science. Porto, Portugal: Springer-Verlag, 2009. [8] Antonellis P, Makris C, Tsirakis N. Algorithms for Clustering Clickstream Data[J]. Information Processing Letters, 2009, 109(8): 381-385. [9] 杨宁, 唐常杰, 王悦, 等. 一种基于时态密度的倾斜分布数据流聚类算法[J]. 软件学报, 2010, 21(5): 1031-1041. [10] Kranen P, Assent I. Self-adaptive Anytime Stream Clustering[C]//Proc. of the 9th IEEE International Conference on Data Mining. [S. 1.]: IEEE Computer Society, 2009. [11] Kranen P, Reidl F, Villaamil F, et al. Hierarchical Clustering for Real-time Stream Data with Noise[C]//Proc. of the 23rd International Conference on Scientific and Statistical Database Management. Portland, USA: Springer-Verlag, 2011. [12] 邵峰晶, 于忠清. 数据挖掘原理与算法[M]. 北京: 中国水利水电出版社, 2009. [13] Tan Pangning, Michael S, Vipin K. 数据挖掘导论[M]. 2版. 范明, 范宏建, 译. 北京: 人民邮电出版社, 2011. 编辑索书志

[1]	江雨燕, 陶承凤, 李平. 数据增强和自适应自步学习的深度子空间聚类算法[J]. 计算机工程, 2023, 49(8): 96-103, 110.
[2]	郑美光, 杨泳. 基于互信息软聚类的个性化联邦学习算法[J]. 计算机工程, 2023, 49(8): 20-28.
[3]	李泽水, 冀俊忠, 杨翠翠. 基于边权重信息深度网络嵌入的PPIN功能模块检测[J]. 计算机工程, 2023, 49(8): 69-76.
[4]	邱天晨, 郑小盈, 祝永新, 封松林. 面向非独立同分布数据的联邦学习架构[J]. 计算机工程, 2023, 49(7): 110-117.
[5]	位雅, 张正军, 何凯琳, 唐莉. 基于相对密度的密度峰值聚类算法[J]. 计算机工程, 2023, 49(6): 53-61.
[6]	戴浩磊, 黄永慧, 周郭许. 基于超图正则化非负张量链分解的聚类分析[J]. 计算机工程, 2023, 49(6): 81-89.
[7]	付嘉豪, 杨嘉怡, 李爱国. 面向安防系统的高效用语义轨迹模式挖掘[J]. 计算机工程, 2023, 49(6): 62-70.
[8]	高小方, 原玉梁, 温静, 白雪飞. 面向相交多流形聚类的标签传播算法[J]. 计算机工程, 2023, 49(6): 90-98.
[9]	马嘉翔, 宋晓宁. 基于彩票假设的软剪枝算法[J]. 计算机工程, 2023, 49(5): 97-104.
[10]	安志国, 彭政, 易满成, 刘健欣, 俞思帆. 神经网络滤波器竞争训练[J]. 计算机工程, 2023, 49(4): 120-124.
[11]	杜明, 郝燕, 周军锋, 谭玉婷. 一种高效的周期团挖掘方法[J]. 计算机工程, 2023, 49(4): 68-76.
[12]	李晓腾, 张盼盼, 勾智楠, 高凯. 基于多任务学习的多模态命名实体识别方法[J]. 计算机工程, 2023, 49(4): 114-119.
[13]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.
[14]	蔡瑞初, 伍运金, 陈薇, 郝志峰. 面向多元时间序列的群体因果关系发现算法[J]. 计算机工程, 2023, 49(2): 127-135.
[15]	袁立宁, 胡皓, 刘钊. 基于多通道图卷积自编码器的图表示学习[J]. 计算机工程, 2023, 49(2): 150-160,174.

选择文件类型/文献管理软件名称

选择包含的内容

基于倾斜分布的变流速数据流聚类算法

Data Stream Clustering Algorithm with Variable Flow Rate Based on Skew Distribution

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于倾斜分布的变流速数据流聚类算法

Data Stream Clustering Algorithm with Variable Flow Rate Based on Skew Distribution

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价