 • 人工智能与模式识别 •

### 基于距离和密度的PBK-means算法

1. 兰州大学 信息科学与工程学院, 兰州 730000
• 收稿日期:2019-07-24 修回日期:2019-09-05 发布日期:2019-09-12
• 作者简介:魏文浩(1993-),男,硕士研究生,主研方向为数据挖掘、大数据处理;唐泽坤,硕士研究生;刘刚(通信作者),讲师。
• 基金项目:
中央高校基本科研业务费专项资金重点项目"基于大数据的城市公共安全风险预警研究"（17LZUJBWZD012）；教育部哲学社会科学研究重大课题攻关项目"大数据驱动的城市公共安全风险研究"（16JZD023）。

### PBK-means Algorithm Based on Distance and Density

WEI Wenhao, TANG Zekun, LIU Gang

1. School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
• Received:2019-07-24 Revised:2019-09-05 Published:2019-09-12

Abstract: The randomness of the initial center point selection of the K-means algorithm and its sensitivity to noise points make the clustering result easily fall into the local optimal solution.In order to obtain the best initial clustering center,this paper proposes a parallel bisecting K-means algorithm based on distance and density.The algorithm calculates the average distance between dataset samples.Based on the distance between the data points,the weight of the data is calculated,and the most heavily weighted data is chosen as the first center point.Also,data whose distance from the first center point is less than the average sample distance do not participate in the next round of clustering.The weights of the remaining data points are multiplied with the distance from the selected center point,and the data with the largest value is chosen as the next center point.After the two center points are obtained, the data are distributed according to the distance from them.The classes represented by each center point are divided into two categories,and on each category the above steps are repeated.The algorithm simulates the way of cell division to segment the data,and constructs a full binary tree.When the number of leaf nodes exceeds the number of categories,k,the clustering is stopped and k initial clustering centers are obtained by merging leaf nodes to execute the K-means algorithm. Test results on the UCI public dataset show that the proposed algorithm has higher efficiency and better clustering performance compared with the traditional K-means algorithm,Canopy-Kmeans algorithm,Bisecting K-means algorithm,WK-means algorithm,MWK-means algorithm and DCK-means algorithm.