作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (6): 90-98. doi: 10.19678/j.issn.1000-3428.0064925

• 人工智能与模式识别 • 上一篇    下一篇

面向相交多流形聚类的标签传播算法

高小方, 原玉梁, 温静, 白雪飞   

  1. 山西大学 计算机与信息技术学院, 太原 030006
  • 收稿日期:2022-06-08 修回日期:2022-08-02 发布日期:2022-09-30
  • 作者简介:高小方(1978-),女,副教授、博士,主研方向为数据挖掘、机器学习;原玉梁,硕士研究生;温静、白雪飞,副教授、博士。
  • 基金资助:
    国家自然科学基金(61772323);山西省自然科学基金(201701D121053)。

Label Propagation Algorithm for Intersecting Multi-manifolds Clustering

GAO Xiaofang, YUAN Yuliang, WEN Jing, BAI Xuefei   

  1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
  • Received:2022-06-08 Revised:2022-08-02 Published:2022-09-30

摘要: 经典的流形学习算法假设样本数据位于高维单流形上,但在现实生活中的真实数据通常位于高维多流形上,且这些数据往往相互交叠,导致流形学习算法效果不佳。传统的标签传播算法通过相似性矩阵构建连接矩阵,实现良好分离数据的聚类,但不能有效聚类相互交叠的多流形数据。针对该问题,提出一种面向相交多流形的标签传播算法LPAMMC。采用局部主成分分析算法确定相交多流形数据的相交区域,并基于混合概率主成分分析(MPPCA)模型和多流形的拓扑结构划分相互交叠的子流形,构建“must-link”和“cannot-link”聚类约束,通过约束构建适合相交多流形数据的传播矩阵,实现标签传播算法。LPAMMC算法通过MPPCA模型和多流形拓扑结构划分出子流形,提高相交多流形数据的聚类精度,且MPPCA模型仅用于多流形数据的相交区域,降低了计算复杂度。实验结果表明,LPAMMC算法不仅具有标签传播算法速度快的特点,且能有效聚类相交多流形数据。在Two spirals数据集上的聚类精度、标准互信息和调整兰德系数取得了与SMMC算法相同的性能,运行时间缩短86.7个百分点。

关键词: 流形学习, 多流形聚类, 切空间, 相交, 标签传播

Abstract: The classical manifold learning algorithm assumes that the sample data is located on a high-dimensional single manifold;however,the real data in real life is located on a high-dimensional multi-manifold,and these data often overlap,resulting in poor manifold learning algorithms.The conventional label propagation algorithm builds the connection matrix through the similarity matrix to achieve clustering of well-separated data;however,it cannot effectively cluster over-lapping multi-manifold data.To address this problem,a label propagation algorithm,LPAMMC,for intersecting multi-manifold is proposed.The algorithm first uses the local Principal Component Analysis(PCA) algorithm to determine the intersection area of the intersecting multi-manifold data,and then divides the overlapping sub-manifolds based on the Mixed Probability Principal Component Analysis(MPPCA) model and the multi-manifold topology. Thereafter,it constructs "must-link" and "cannot-link" clustering constraints,and then uses the constraints to construct a propagation matrix suitable for intersecting multi-manifold data to implement the label propagation algorithm.The LPAMMC divides the sub-manifolds through the MPPCA model and the multi-manifold topology,which improves the clustering accuracy of intersecting multi-manifold data,and the MPPCA model is used for the intersection area of the multi-manifold data,reducing the computational complexity.The experimental results demonstrate that the LPAMMC algorithm has the characteristics of fast label propagation algorithm and can effectively cluster intersecting multi-manifold data,clustering accuracy,standard mutual information,and adjusted rand coefficients on the Two spirals dataset to achieve the same performance as the SMMC algorithm with an 86.7 percent points reduction in running time.

Key words: manifold learning, multi-manifolds clustering, tangent space, intersecting, label propagation

中图分类号: