基于KD树划分的云计算DBSCAN优化算法

doi:10.3969/j.issn.1000-3428.2017.04.004

计算机工程

基于KD树划分的云计算DBSCAN优化算法

陈广胜^1,2,程逸群^1,2,景维鹏^1,2

(1.东北林业大学信息与计算机工程学院,哈尔滨 150040;2.黑龙江省林业生态大数据存储与高性能(云)计算工程研究中心,哈尔滨 150040)

收稿日期:2016-03-18 出版日期:2017-04-15 发布日期:2017-04-14
作者简介:陈广胜(1969—),男,教授、博士,主研方向为大数据存储、云计算;程逸群,硕士研究生;景维鹏,副教授、博士。
基金资助:
黑龙江省自然科学基金重点项目(ZD201403);林业公益性行业科研专项(201504307)。

DBSCAN Optimization Algorithm Based on KD-tree Partitioning in Cloud Computing

CHEN Guangsheng ^1,2,CHENG Yiqun ^1,2,JING Weipeng ^1,2

(1.College of Information and Computer Engineering,Northeast Forestry University,Harbin 150040,China; 2.Heilongjiang Province Engineering Technology Research Center for Forestry Ecological Big Data Storage and High Performance(Cloud) Computing,Harbin 150040,China)

Received:2016-03-18 Online:2017-04-15 Published:2017-04-14

摘要/Abstract

摘要： 在并行RDD-DBSCAN算法的数据划分和区域查询过程中会对数据集进行重复访问,降低了算法效率。为此,提出基于数据划分和融合策略的并行DBSCAN算法(DBSCAN-PSM)。利用KD树进行数据划分,实现数据分区与区域查询步骤的合并,从而减少数据集的访问次数以及降低I/O过程对算法效率的影响。采用判定数据点自身属性的方式,对标注为边缘点的数据进行融合,避免全局标记的额外时间开销。实验结果表明,DBSCAN-PSM算法相比RDD-DBSCAN算法可节省18%左右的运行时间,适用于处理海量数据聚类问题。

关键词: 聚类, DBSCAN算法, Spark平台, 数据划分, 数据融合

Abstract: The parallel RDD-DBSCAN algorithm has a repeated access to the data set in the data partition and region query steps,which reduces the efficiency of the algorithm.Aiming at the above problems,a parallel DBSCAN algorithm based on data partitioning and fusion stragy(DBSCAN-PSM) is proposed.It imports the KD-tree to partition the data,merges the partition and region query steps,reduces the number of access to the data set and decreases the influence of I/O on the algorithm.Data fusion method is realized by determining the clustering characteristics of the spatial boundary points,which avoids the time overhead of global markup.Experimental results show that DBSCAN-PSM algorithm runs faster than RDD-DBSCAN by 18%.It can deal with mass data clustering problem more effectively.

Key words: clustering, DBSCAN algorithm, Spark platform, data partitioning, data fusion

中图分类号:

TP311

陈广胜,程逸群,景维鹏. 基于KD树划分的云计算DBSCAN优化算法[J]. 计算机工程.

CHEN Guangsheng,CHENG Yiqun,JING Weipeng. DBSCAN Optimization Algorithm Based on KD-tree Partitioning in Cloud Computing[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2017/V43/I4/21

参考文献

参考文献［1］孙吉贵,刘杰,赵连宇.聚类算法研究［J］.软件学报,2008,19(1):48-61. ［2］Chen Min,Gao Xuedong,Li Huifei.Parallel DBSCAN with Priority R-tree［C］//Proceedings of IEEE Inter-national Conference on Information Management and Engineering.Washington D.C.,USA:IEEE Press,2010:508-511. ［3］Wikipedia.Mapreduce［EB/OL］.(2015-04-04).http://en.wikipedia.org/wiki/MapReduce. ［4］冀素琴,石洪波.基于MapReduce的K-means聚类集成［J］.计算机工程,2013,39(9):84-87. ［5］Dai Biru,Lin I C.Efficient Map/Reduce-based DBSCAN Algorithm with Optimized Data Partition［C］//Proceedings of IEEE International Conference on Cloud Computing.Washington D.C.,USA:IEEE Press,2012:59-66. ［6］He Yaobin,Tan Haoyu,Luo Wuman,et al.MR-DBSCAN:An Efficient Parallel Density-based Clustering Algorithm Using MapReduce［C］//Proceedings of IEEE International Conference on Parallel & Distributed Systems.Washington D.C.,USA:IEEE Press,2011:473-480. ［7］Zaharia M,Chowdhury M,Das T,et al.Resilient Distributed Datasets:A Fault-tolerant Abstraction for In-memory Cluster Computing［C］//Proceedings of Usenix Conference on Networked Systems Design & Implementa-tion.San Jose,USA:USENIX Association,2012:2. ［8］李璐明,蒋新华,廖律超.基于弹性分布式数据集的海量空间数据密度聚类［J］.湖南大学学报(自然科学版),2015,42(8):116-124. ［9］Cordova I,Moh T S.DBSCAN on Resilient Distributed Datasets［C］//Proceedings of International Conference on High Performance Computing & Simulation.Washington D.C.,USA:IEEE Press,2015:531-540. ［10］于亚非,周爱武.一种改进的DBSCAN密度算法［J］.计算机技术与发展,2011,21(2):30-33. ［11］Berger M J,Bokhari S H.A Partitioning Strategy for Nonuniform Problems on Multiprocessors［J］.Com-puters,1987,100(5):570-580. ［12］Wikipedia.KD-tree［EB/OL］.(2015-04-10).https://en.wikipedia.org/wiki/K-d_tree. ［13］周水庚,周傲英,曹晶.基于数据分区的DBSCAN算法［J］.计算机研究与发展,2000,37(10):1153-1159. ［14］Scikit-learn.Dataset Loading Utilities［EB/OL］.(2015-11-09).http://scikit-learn.org/stable/datasets/. ［15］Pedregosa F,Varoquaux G,Gramfort A,et al.Scikit-learn:Machine Learning in Python［J］.Journal of Machine Learning Research,2011,12(1):2825-2830. 编辑陆燕菲

[1]	郭继鹏, 徐世龙, 龙家豪, 王友清, 孙艳丰, 尹宝才. 基于双跨视角相关性检测的多视角子空间聚类[J]. 计算机工程, 2025, 51(4): 27-36.
[2]	李启文, 王治和, 杜辉, 鲁德鹏. 基于高斯分布的自适应密度峰值聚类算法[J]. 计算机工程, 2025, 51(4): 137-148.
[3]	韩鹏, 黄韫栀, 任彩月, 程竞仪, 徐军. 基于双分支网络的乳腺PET新辅助化疗疗效评估[J]. 计算机工程, 2025, 51(3): 293-299.
[4]	李红娇, 王宝金, 王朝晖, 胡仁豪. 基于模型相似度与本地损失的双重客户端选择算法[J]. 计算机工程, 2024, 50(8): 153-164.
[5]	徐明亮, 李芳媛, 马浩然, 何飞. 大规模神经记录的峰电位聚类算法(特邀)[J]. 计算机工程, 2024, 50(6): 1-34.
[6]	胡傲然, 陈晓红. 基于多样性与一致性的单步多视图聚类[J]. 计算机工程, 2024, 50(5): 51-61.
[7]	司明悦, 齐斌, 张文胜, 张雷. 基于张量计算的智慧交通多维数据计算与小样本学习[J]. 计算机工程, 2024, 50(4): 41-49.
[8]	马越, 温蜜. 基于多尺度LDTW和TCN的空间负荷预测方法[J]. 计算机工程, 2024, 50(3): 106-113.
[9]	宋华伟, 李升起, 万方杰, 卫玉萍. 非独立同分布场景下的联邦学习优化方法[J]. 计算机工程, 2024, 50(3): 166-172.
[10]	王丽娟, 邢津萍, 尹明, 郝志峰, 蔡瑞初, 温雯. 基于一致性图的权重自适应多视角谱聚类算法[J]. 计算机工程, 2024, 50(2): 122-131.
[11]	潘伟, 黄瑞章, 任丽娜, 薛菁菁. 基于自适应结构学习的深度文本聚类[J]. 计算机工程, 2024, 50(11): 89-97.
[12]	张玉杰, 高晗. 基于改进FCM的冲压件缺陷图像分割算法[J]. 计算机工程, 2024, 50(10): 342-351.
[13]	刘大兴, 顾乃杰, 黄章进, 苏俊杰, 齐东升. 一种用于软件预取的访存轨迹采样算法[J]. 计算机工程, 2024, 50(10): 362-369.
[14]	张俊娜, 韩超臣, 陈家伟, 赵晓焱, 袁培燕. 一种联合边缘服务器部署与服务放置的方法[J]. 计算机工程, 2024, 50(10): 266-280.
[15]	刘思慧, 高全学, 宋伟, 谢德燕. 基于加权张量低秩约束的多视图谱聚类[J]. 计算机工程, 2024, 50(1): 129-137.

选择文件类型/文献管理软件名称

选择包含的内容

基于KD树划分的云计算DBSCAN优化算法

DBSCAN Optimization Algorithm Based on KD-tree Partitioning in Cloud Computing

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于KD树划分的云计算DBSCAN优化算法

DBSCAN Optimization Algorithm Based on KD-tree Partitioning in Cloud Computing

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价