作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (9): 68-75. doi: 10.19678/j.issn.1000-3428.0055574

• 人工智能与模式识别 • 上一篇    下一篇

基于距离和密度的PBK-means算法

魏文浩, 唐泽坤, 刘刚   

  1. 兰州大学 信息科学与工程学院, 兰州 730000
  • 收稿日期:2019-07-24 修回日期:2019-09-05 发布日期:2019-09-12
  • 作者简介:魏文浩(1993-),男,硕士研究生,主研方向为数据挖掘、大数据处理;唐泽坤,硕士研究生;刘刚(通信作者),讲师。
  • 基金资助:
    中央高校基本科研业务费专项资金重点项目"基于大数据的城市公共安全风险预警研究"(17LZUJBWZD012);教育部哲学社会科学研究重大课题攻关项目"大数据驱动的城市公共安全风险研究"(16JZD023)。

PBK-means Algorithm Based on Distance and Density

WEI Wenhao, TANG Zekun, LIU Gang   

  1. School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
  • Received:2019-07-24 Revised:2019-09-05 Published:2019-09-12

摘要: K-means算法初始中心点选择的随机性以及对噪声点的敏感性,使得聚类结果易陷入局部最优解,为获得最佳初始聚类中心,提出一种基于距离和密度的并行二分K-means算法。计算数据集的平均样本距离,根据数据点之间的距离计算数据的权重,选择最大权重数据点作为第一个中心点,小于平均样本距离的数据点不参加下一次聚类,将剩余数据点的权重与中心点距离相乘,选择值最大的数据点作为下一个中心点,得到两个中心点后按照距离对数据进行分配,将每个中心点代表的类分为两类后在每类上继续重复上述步骤。通过模仿细胞分裂的方法对数据进行切分,构建一棵满二叉树,当叶子结点数超过类别数k时停止聚类,合并叶子结点得到k个初始聚类中心执行K-means算法。在UCI公开数据集上进行测试,结果表明,对比传统K-means算法、Canopy-Kmeans算法、二分K-means算法、WK-means算法、MWK-means算法和DCK-means算法,该算法效率更高,具有较好的聚类效果。

关键词: 二分K-means算法, 聚类中心, 初始中心点, 权重, 数据挖掘

Abstract: The randomness of the initial center point selection of the K-means algorithm and its sensitivity to noise points make the clustering result easily fall into the local optimal solution.In order to obtain the best initial clustering center,this paper proposes a parallel bisecting K-means algorithm based on distance and density.The algorithm calculates the average distance between dataset samples.Based on the distance between the data points,the weight of the data is calculated,and the most heavily weighted data is chosen as the first center point.Also,data whose distance from the first center point is less than the average sample distance do not participate in the next round of clustering.The weights of the remaining data points are multiplied with the distance from the selected center point,and the data with the largest value is chosen as the next center point.After the two center points are obtained, the data are distributed according to the distance from them.The classes represented by each center point are divided into two categories,and on each category the above steps are repeated.The algorithm simulates the way of cell division to segment the data,and constructs a full binary tree.When the number of leaf nodes exceeds the number of categories,k,the clustering is stopped and k initial clustering centers are obtained by merging leaf nodes to execute the K-means algorithm. Test results on the UCI public dataset show that the proposed algorithm has higher efficiency and better clustering performance compared with the traditional K-means algorithm,Canopy-Kmeans algorithm,Bisecting K-means algorithm,WK-means algorithm,MWK-means algorithm and DCK-means algorithm.

Key words: bisecting K-means algorithm, clustering center, initial center point, weight, data mining

中图分类号: