摘要: 在并行RDD-DBSCAN算法的数据划分和区域查询过程中会对数据集进行重复访问,降低了算法效率。为此,提出基于数据划分和融合策略的并行DBSCAN算法(DBSCAN-PSM)。利用KD树进行数据划分,实现数据分区与区域查询步骤的合并,从而减少数据集的访问次数以及降低I/O过
程对算法效率的影响。采用判定数据点自身属性的方式,对标注为边缘点的数据进行融合,避免全局标记的额外时间开销。实验结果表明,DBSCAN-PSM算法相比RDD-DBSCAN算法可节省18%左右的运行时间,适用于处理海量数据聚类问题。
关键词:
聚类,
DBSCAN算法,
Spark平台,
数据划分,
数据融合
Abstract: The parallel RDD-DBSCAN algorithm has a repeated access to the data set in the data partition and region query steps,which reduces the efficiency of the algorithm.Aiming at the above problems,a parallel DBSCAN algorithm based on data partitioning and fusion stragy(DBSCAN-PSM) is proposed.It imports the KD-tree to partition the data,merges the partition and region query steps,reduces the number of access to the data set and decreases the influence of I/O on the algorithm.Data fusion method is realized by determining the clustering characteristics of the spatial boundary points,which avoids the time overhead of global markup.Experimental results show that DBSCAN-PSM algorithm runs faster than RDD-DBSCAN by 18%.It can deal with mass data clustering problem more effectively.
Key words:
clustering,
DBSCAN algorithm,
Spark platform,
data partitioning,
data fusion
中图分类号: