基于距离和密度的PBK-means算法

doi:10.19678/j.issn.1000-3428.0055574

摘要/Abstract

摘要： K-means算法初始中心点选择的随机性以及对噪声点的敏感性，使得聚类结果易陷入局部最优解，为获得最佳初始聚类中心，提出一种基于距离和密度的并行二分K-means算法。计算数据集的平均样本距离，根据数据点之间的距离计算数据的权重，选择最大权重数据点作为第一个中心点，小于平均样本距离的数据点不参加下一次聚类，将剩余数据点的权重与中心点距离相乘，选择值最大的数据点作为下一个中心点，得到两个中心点后按照距离对数据进行分配，将每个中心点代表的类分为两类后在每类上继续重复上述步骤。通过模仿细胞分裂的方法对数据进行切分，构建一棵满二叉树，当叶子结点数超过类别数k时停止聚类，合并叶子结点得到k个初始聚类中心执行K-means算法。在UCI公开数据集上进行测试，结果表明，对比传统K-means算法、Canopy-Kmeans算法、二分K-means算法、WK-means算法、MWK-means算法和DCK-means算法，该算法效率更高，具有较好的聚类效果。

关键词: 二分K-means算法, 聚类中心, 初始中心点, 权重, 数据挖掘

Abstract: The randomness of the initial center point selection of the K-means algorithm and its sensitivity to noise points make the clustering result easily fall into the local optimal solution.In order to obtain the best initial clustering center,this paper proposes a parallel bisecting K-means algorithm based on distance and density.The algorithm calculates the average distance between dataset samples.Based on the distance between the data points,the weight of the data is calculated,and the most heavily weighted data is chosen as the first center point.Also,data whose distance from the first center point is less than the average sample distance do not participate in the next round of clustering.The weights of the remaining data points are multiplied with the distance from the selected center point,and the data with the largest value is chosen as the next center point.After the two center points are obtained, the data are distributed according to the distance from them.The classes represented by each center point are divided into two categories,and on each category the above steps are repeated.The algorithm simulates the way of cell division to segment the data,and constructs a full binary tree.When the number of leaf nodes exceeds the number of categories,k,the clustering is stopped and k initial clustering centers are obtained by merging leaf nodes to execute the K-means algorithm. Test results on the UCI public dataset show that the proposed algorithm has higher efficiency and better clustering performance compared with the traditional K-means algorithm,Canopy-Kmeans algorithm,Bisecting K-means algorithm,WK-means algorithm,MWK-means algorithm and DCK-means algorithm.

Key words: bisecting K-means algorithm, clustering center, initial center point, weight, data mining

中图分类号:

TP301.6

魏文浩, 唐泽坤, 刘刚. 基于距离和密度的PBK-means算法[J]. 计算机工程, 2020, 46(9): 68-75.

WEI Wenhao, TANG Zekun, LIU Gang. PBK-means Algorithm Based on Distance and Density[J]. Computer Engineering, 2020, 46(9): 68-75.

https://www.ecice06.com/CN/Y2020/V46/I9/68

图/表 16

20200915134651

20200915134654

20200915134657

20200915134702

20200915134706

20200915134710

20200915134713

20200915134716

20200915134719

20200915134723

20200915134726

20200915134729

20200915134733

20200915134737

20200915134740

20200915134744

参考文献

[1] FAYYAD U M,PIATETSKY-SHAPIRO G,SMYTH P,et al.Advances in knowledge discovery and data mining[M].[S.1.]:AAAI/MIT Press,1996.
[2] KUNAR K M,REDDY A R M.An efficient k-means clustering filtering algorithm using density based initial cluster centers[J].Information Sciences,2017,411-419:286-301.
[3] HAN J,KAMBER M,PEI J,et al.Data mining:concepts and technology[M].3rd ed.Translated by FAN Ming,MENG Xiaofeng.Beijing:Mechanical Industry Press,2012.(in Chinese) HAN J,KAMBER M,PEI J,et al.数据挖掘:概念与技术[M].3版.范明,孟小峰,译.北京:机械工业出版社,2012.
[4] SALEHI M,RASHIDI L.A Survey on anomaly detection in evolving data[J].SIGKDD Explorations,2018,20(1):13-23.
[5] LLOYD S P.Least Squares Quantization in PCM[J].IEEE Transactions on Information Theory,1982,28(2):129-136.
[6] JAIN A K.Data clustering:50 years beyond k-means,pattern recognition letters[EB/OL].[2019-07-20].http://dx.doi.org/10.1016/j.patrec.2009.09.011.
[7] BU Yuanyuan,GUAN Zhongren.Research of clustering algorithm based on k-means[J].Journal of Southwest University for Nationalities(Natural Science Edition),2009,35(1):198-200.(in Chinese)步媛媛,关忠仁.基于k-means聚类算法的研究[J].西南民族大学学报(自然科学版),2009,35(1):198-200.
[8] WANG Jun,WANG Shitong,DENG Zhaohong.A novel text clustering algorithm based on feature weighting distance and soft subspace learning[J].Chinese Journal of Computers,2012,35(8):1655-1665.(in Chinese)王骏,王士同,邓赵红.特征加权距离与软子空间学习相结合的文本聚类新方法[J].计算机学报,2012,35(8):1655-1665.
[9] ZHANG Jianpei,YANG Yu,YANG Jing,et al.Algorithm for initialization of K-Means clustering center based on optimized-division[J].Journal of System Simulation,2009,21(9):2586-2590.(in Chinese)张健沛,杨悦,杨静.基于最优划分的K-Means初始聚类中心选取算法[J].系统仿真学报,2009,21(9):2586-2590.
[10] ALSABTI K,RANKA S,SINGLY V.An efficient k-means clustering algorithm[C]//Proceedings of the 1st Workshop on High Performance Data Mining.Washington D.C.,USA:IEEE Press,1998:35-43.
[11] BECKMANN N,KRIEGEL H P,SCHNEIDER R,et al.The r-tree:an efficient and robust access method for points and rectangles[C]//Proceedings of ACMSIGMOD International Conference on Management of Data.Washington D.C.,USA:IEEE Press,1990:322-331.
[12] BERCHTOLD S,KEIM D,KRIEGEL H P.The x-tree:an efficient and robust access method for points and rectangles[C]//Proceedings of International Conference on Very Large Data Bases.Washington D.C.,USA:IEEE Press,1996:28-39.
[13] XIE Juanying,GAO Hongchao.Statistical correlation and K-Means based distinguishable gene subset selection algorithms[J].Journal of Software,2014,25(9):2050-2075.(in Chinese)谢娟英,高红超.基于统计相关性与K-means的区分基因子集选择算法[J].软件学报,2014,25(9):2050-2075.
[14] SOROOSH A,STEPHEN M S,THOMAS E N.Effective degrees of freedom of the Pearson's correlation coefficient under autocorrelation[J].NeuroImage,2019,199:609-625.
[15] HUANG J Z,XU J,NG M,et al.Weighting method for feature selection in k-means[C]//Proceedings of Computational Methods of Feature Selection.[S.1.]:CRC Press,2008:193-209.
[16] CHAN Y,CHING W K,NG M K,et al.An optimization algorithm for clustering using weighted dissimilarity measures[J].Pattern Recognition,2004,37(5):943-952.
[17] HUANG J Z,NG M K,RONG H,et al.Automated variable weighting in K-Means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Learning,2005,27(5):657-668.
[18] MAKARENKOV V,LEGENDRE P.Optimal variable weighting for ultrametric and additive trees and K-Means partitioning[J].Journal of Classification,2001,18:245-271.
[19] DE AMORIM R C,MIRKIN B,METRIC M.Feature weighting and anomalous cluster initializing in K-Means clustering[J].Parttern Recognition,2012,45:1061-1075.
[20] AN Jingmin,LI Guanyu.Domain concept clustering method based on graph entropy extreme value theory[J].Computer Engineering,2020,46(6):88-93.(in Chinese)安敬民,李冠宇.基于图熵极值化理论的领域概念聚类方法[J].计算机工程,2020,46(6):88-93.
[21] ESTER M,KRIEGEL H,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining.[S.1.]:AAAI Press,1996:226-231.
[22] ANKERST M,BREUNIG M,KRIEGEL H P.OPTICS:ordering points to identify the clustering structure[C]//Proceedings of International Conference on Management of Data.Philadelphia,USA:[s.n.],1999:49-60.
[23] RODRIGUEZ A,LAIO A.Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
[24] ZHANG Geng,ZHANG Chengchang,ZHANG Huayu.Improved K-Means algorithm based on density canopy[J].Knowledge-Based Systems,2018,145:289-297.

选择文件类型/文献管理软件名称

选择包含的内容