Density Peak Clustering Algorithm Based on Relative Density

doi:10.19678/j.issn.1000-3428.0064368

Abstract

Abstract: When the density peak clustering algorithm deals with datasets with uneven density，it is easy to divide the low-density clusters into high-density clusters，divide the high-density clusters into multiple sub-clusters，and exists the error propagation occurs in the process of sample point allocation.To solve these problems，a density peak clustering algorithm based on relative density is proposed.The proposed algorithm introduces sample point information in the natural nearest neighborhood，provides a new local density calculation method，and calculates the relative density. After drawing a decision diagram to determine the cluster centers，considering the density difference between clusters，a density factor is proposed to calculate the clustering distance of each cluster，and the remaining sample points are divided according to the clustering distance，then the proposed algorithm clusters datasets with different shapes and densities.Experiments are performed on synthetic and real datasets for comparison with the classical density peak clustering algorithm and three other clustering algorithms.The results show that the proposed algorithm increases the Fowlkes and Mallows Index（FMI），Adjusted Rand Index（ARI），and Normalized Mutual Information（NMI） by an average of approximately 14，26，and 21 percentage points，respectively. At the same time，the proposed algorithm has great advantages in accurately identifying cluster centers and assigning the remaining sample points to datasets with large differences in densities between clusters.

Key words: clustering, density peak, relative density, density factor, clustering distance, natural nearest neighborhood

摘要： 密度峰值聚类算法在处理密度不均匀的数据集时易将低密度簇划分到高密度簇中或将高密度簇分为多个子簇，且在样本点分配过程中存在误差传递问题。提出一种基于相对密度的密度峰值聚类算法。引入自然最近邻域内的样本点信息，给出新的局部密度计算方法并计算相对密度。在绘制决策图确定聚类中心后，基于对簇间密度差异的考虑，提出密度因子计算各个簇的聚类距离，根据聚类距离对剩余样本点进行划分，实现不同形状、不同密度数据集的聚类。在合成数据集和真实数据集上进行实验，结果表明，该算法的FMI、ARI和NMI指标较经典的密度峰值聚类算法和其他3种聚类算法分别平均提高约14、26和21个百分点，并且在簇间密度相差较大的数据集上能够准确识别聚类中心和分配剩余的样本点。

关键词: 聚类, 密度峰值, 相对密度, 密度因子, 聚类距离, 自然最近邻

CLC Number:

TP301.6

WEI Ya, ZHANG Zhengjun, HE Kailin, TANG Li. Density Peak Clustering Algorithm Based on Relative Density[J]. Computer Engineering, 2023, 49(6): 53-61.

位雅, 张正军, 何凯琳, 唐莉. 基于相对密度的密度峰值聚类算法[J]. 计算机工程, 2023, 49(6): 53-61.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064368

http://www.ecice06.com/EN/Y2023/V49/I6/53

Figures/Tables 8

References

[1] 谢行雨,王玲.基于纹理相似栈的超声图像分割方法[J].计算机工程,2019,45(2):240-244.XIE X Y,WANG L.Ultrasonic image segmentation method based on similar texture stacks[J].Computer Engineering,2019,45(2):240-244.(in Chinese)
[2] 程仙国,王明军.融合SLIC与改进邻近传播聚类的彩色图像分割算法[J].计算机工程,2018,44(6):226-232.CHENG X G,WANG M J.Color image segmentation algorithm combining SLIC with improved affinity propagation clustering[J].Computer Engineering,2018,44(6):226-232.(in Chinese)
[3] 张和平,李俊武.基于模糊c均值聚类算法的控制图模式识别[J].工业工程,2021,24(5):108-116.ZHANG H P,LI J W.Recognition of control chart patterns using fuzzy c-means algorithm[J].Industrial Engineering Journal,2021,24(5):108-116.(in Chinese)
[4] 崔家俊.基于K-means聚类算法的专变用户负荷模式识别方法研究[D].天津:河北工业大学,2020.CUI J J.Research on the method of recognizing specific user load patterns based on K-means clustering algorithm[D].Tianjin:Hebei University of Technology,2020.(in Chinese)
[5] MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.[S.l.]:University of California Press,1967:281-297.
[6] HADI A S,KAUFMAN L,ROUSSEEUW P J.Finding groups in data:an introduction to cluster analysis[J].Technometrics,1992,34(1):111.
[7] ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of KDD'96.New York,USA:ACM Press,1996:226-231.
[8] WANG W,YANG J,MUNTZ R.STING:a statistical information grid approach to spatial data mining[C]//Proceedings of the 23rd International Conference on Very Large Data Bases.Athens,Greece:[s.n.],1997:186-195.
[9] DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical,Society Series B:Statistical Methodology,1977,39(1):1-22.
[10] RODRIGUEZ A,LAIO A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
[11] GENG Y,LI Q Y,ZHENG R,et al.RECOME:a new density-based clustering algorithm using relative KNN kernel density[J].Information Sciences,2018,436/437:13-30.
[12] 杨震,王红军.基于加权K近邻的改进密度峰值聚类算法[J].计算机应用研究,2020,37(3):667-671.YANG Z,WANG H J.Improved density peak clustering algorithm based on weighted K-nearest neighbor[J].Application Research of Computers,2020,37(3):667-671.(in Chinese)
[13] 代永杨,张清华,支学超.融合相对密度与近邻关系的密度峰值聚类算法[J].重庆邮电大学学报(自然科学版),2021,33(5):791-805.DAI Y Y,ZHANG Q H,ZHI X C.Density peaks clustering algorithm by combining relative density with nearest neighbor relationship[J].Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition),2021,33(5):791-805.(in Chinese)
[14] 汤鑫瑶,张正军,储杰,等.基于自然最近邻的密度峰值聚类算法[J].计算机科学,2021,48(3):151-157.TANG X Y,ZHANG Z J,CHU J,et al.Density peaks clustering algorithm based on natural nearest neighbor[J].Computer Science,2021,48(3):151-157.(in Chinese)
[15] LIU Y H,MA Z M,YU F.Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy[J].Knowledge-Based Systems,2017,133:208-220.
[16] LIU R,WANG H,YU X.Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J].Information Sciences,2018,450:200-226.
[17] 赵燕伟,朱芬,桂方志,等.融合可拓关联函数的密度峰值聚类算法[J].小型微型计算机系统,2019,40(12):2512-2518.ZHAO Y W,ZHU F,GUI F Z,et al.Density peak clustering algorithm based on extension correlation function[J].Journal of Chinese Computer Systems,2019,40(12):2512-2518.(in Chinese)
[18] ZOU X L,ZHU Q S,YANG R L.Natural nearest neighbor for isomap algorithm without free-parameter[J].Advanced Materials Research,2011,219/220:994-998.
[19] STEVENS S S.Mathematics,measurement,and psychophysics[M]//STEVENS S.Handbook of experimental psychology.New York,USA:Wiley,1951:1-49.
[20] 黄金龙.基于自然最近邻的无参聚类算法研究[D].重庆:重庆大学,2014.HUANG J L.Study on non-parametric clustering based on natural nearest neighborhood[D].Chongqing:Chongqing University,2014.(in Chinese)
[21] FOWLKES E B,MALLOWS C L.A method for comparing two hierarchical clusterings[J].Journal of the American Statistical Association,1983,78(383):553-569.
[22] VINH N X,EPPS J,BAILEY J.Information theoretic measures for clusterings comparison:variants,properties,normalization and correction for chance[J].Journal of Machine Learning Research,2010,11:2837-2854.
[23] JAIN A K,LAW M H C.Data clustering:a user's dilemma[C]//Proceedings of International Conference on Pattern Recognition and Machine Intelligence.Berlin,Germany:Springer,2005:1-10.
[24] ZAHN C T.Graph-theoretical methods for detecting and describing gestalt clusters[J].IEEE Transactions on Computers,1971,100(1):68-86.
[25] FU L M,MEDICO E.FLAME,a novel fuzzy clustering method for the analysis of DNA microarray data[J].BMC Bioinformatics,2007,8:3.
[26] DUA D,GRAFF C.UCI machine learning repository[EB/OL].[2022-01-10].http://archive.ics.uci.edu/ml.

Please choose a citation manager

Content to export