作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (11): 85-93. doi: 10.19678/j.issn.1000-3428.0065422

• 人工智能与模式识别 • 上一篇    下一篇

一种划分聚类k值与中心初始化的改进方法

苏丰睿, 穆伟伟, 赵宣茗, 裘智峰*   

  1. 中南大学 自动化学院, 长沙 410000
  • 收稿日期:2022-08-03 出版日期:2023-11-15 发布日期:2023-02-20
  • 通讯作者: 裘智峰
  • 作者简介:

    苏丰睿(1998—),女,硕士研究生,主研方向为聚类及数据挖掘

    穆伟伟,硕士研究生

    赵宣铭,博士研究生

  • 基金资助:
    国家自然科学基金面上项目(62073345)

Improved Method for Partitioning Clustering k-values and Center Initialization

Fengrui SU, Weiwei MU, Xuanming ZHAO, Zhifeng QIU*   

  1. School of Automation, Central South University, Changsha 410000, China
  • Received:2022-08-03 Online:2023-11-15 Published:2023-02-20
  • Contact: Zhifeng QIU

摘要:

划分聚类方法由于结构清晰、时间效率高而得到广泛的应用,但在缺乏先验知识的实际工业过程中难以合理地进行簇数和中心初值选取,导致聚类处理效果大打折扣。针对利用误差平方和方法获得的肘部点不明显的问题,提出考虑比例主偏差的误差平方和方法(PPD-SSE)。在误差平方和的基础上引入主偏差项以加强肘部点附近趋势,同时通过引入比例值避免趋势突变,从而更加准确地进行簇数选择。针对利用k-means++方法选取高维数据初始中心时过于随机的问题,提出轮盘重构的k-means++方法(RWR-kmeans++)。利用与已选中心的距离平方,并结合概率下限的方法来重构概率轮盘,提升相异数据被选中的概率, 降低初值选取的随机性,提升聚类效果并使之更加稳定。在UEA & UCR公开数据集上的实验结果表明,所提PPD-SSE方法能够有效提升肘部偏折角及簇数预测的准确性,RWR-kmeans++方法能够有效提升初值选取的相异性及聚类效果。

关键词: 划分聚类, 比例主偏差, 轮盘重构, 簇数选择, 初值选取

Abstract:

Partition clustering methods are widely used because of their clear structure and high time efficiency. However, in the actual industrial process, it is difficult to select the cluster number and initial centers rationally without prior knowledge, which significantly reduces the clustering effect. The elbow points obtained by using the error sum of squares method, are not obvious. To address this problem, the method of Proportional Principal Deviation with Sum of Squared Errors(PPD-SSE)is proposed, whereby based on the sum of squared errors, a main deviation term is introduced to strengthen the trend near the elbow point, and the proportion value is introduced to avoid trend mutation, to enable more accurate selection of the cluster number. In using the k-means++method, there is the problem of random selection of the initial center in high-dimensional data. To address this issue, the Roulette Wheel Reconstruction-k-means++(RWR-kmeans++)method is proposed. Using squared distance with the selected center and combining it with the method of probability lower limit, a probability roulette wheel is reconstructed, thereby increasing the probability of selecting data with large differences and reducing the randomness of initial value selection, which improves the clustering effect and makes the method more stable. Experimental comparisons and analyses on UEA & UCR open data sets showed that the PPD-SSE method can effectively improve the deflection angle of the elbow and accuracy of cluster number prediction, and the RWR-kmeans++method can effectively improve the heterogeneity of initial value selection and the clustering effect.

Key words: partition clustering, proportional principal deviation, roulette wheel reconstruction, cluster number selection, initial value selection