作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (11): 131-142. doi: 10.19678/j.issn.1000-3428.0067397

• 人工智能与模式识别 • 上一篇    下一篇

针对标签噪声数据的自步半监督降维

古楠楠   

  1. 首都经济贸易大学 统计学院, 北京 100070
  • 收稿日期:2023-04-13 出版日期:2023-11-15 发布日期:2023-11-08
  • 作者简介:

    古楠楠(1985-), 女, 教授、博士, CCF会员, 主研方向为半监督降维与分类

  • 基金资助:
    首都经济贸易大学北京市属高校基本科研业务费专项资金(QNTD202109)

Self-Paced Semi-Supervised Dimensionality Reduction for Data with Noisy Labels

Nannan GU   

  1. School of Statistics, Capital University of Economics and Business, Beijing 100070, China
  • Received:2023-04-13 Online:2023-11-15 Published:2023-11-08

摘要:

数据类别标记是一项费时费力的工作,且标记质量会直接影响模型预测性能。基于自步学习机制构建自步半监督降维框架,将由简单到复杂的样本逐步纳入模型训练过程。在此框架下,设计自步半监督降维算法,依据交替优化策略,在更新降维映射函数与计算样本重要度之间交替迭代。一方面,最小化低维标签数据的加权类内分散程度,且考虑再生核希尔伯特空间中的函数复杂度正则化项与数据稀疏结构图上的光滑度正则化项,得到降维映射。另一方面,依据自步学习机制,计算标签数据的低维表示与其所在类的锚点之间的距离,给定下次迭代时样本的重要度。所提框架及算法对标签噪声具有较好的鲁棒性,能自适应给出标签样本的重要度及显性非线性的降维映射,所得的低维表示具有较强的可分性与判别性。在5个实验数据集上,对于标签具有噪声的数据,所提算法获得的低维表示的最近邻分类准确率分别比次优算法最多提高了2.2、5.6、5.0、11.3、2.7个百分点,验证了所提算法的有效性和鲁棒性。

关键词: 半监督降维, 自步学习, 映射, 稀疏表示, 特征提取

Abstract:

Data labeling is a time-consuming and laborious task, and the quality of labeling directly affects the predictive performance of the model. Based on the Self-Paced Learning(SPL) mechanism, a Self-Paced Semi-Supervised Dimensionality Reduction(SPSSDR) framework is proposed to incorporate simple to complex samples into training. The SPSSDR algorithm proposed under this framework alternately iterates between feature mapping updating and sample importance calculating, according to the alternative optimization strategy. On the one hand, to obtain the feature mapping for dimensionality reduction, the weighted intra-class dispersion of low-dimensional labeled data is minimized, considering the complexity regularization term in the reproducing kernel Hilbert space and the smoothness regularization term on the sparse structured data graph. On the other hand, based on the SPL mechanism, the distance between each low-dimensional labeled sample and the corresponding class anchor is calculated, to assign an importance value to the sample in the next iteration. The proposed framework and algorithm robustly label noise and can adaptively provide the importance values of labeled samples and explicit nonlinear feature mapping for dimensionality reduction, thereby obtaining low-dimensional representations with strong separability and discrimination. Compared with the sub-optimal algorithms, the nearest neighbor classification accuracies of low-dimensional representations by the SPSSDR algorithm provide an improvement of up to 2.2, 5.6, 5.0, 11.3, and 2.7 percentage points respectively, on five experimental datasets for data with noisy labels. The experimental results demonstrate the effectiveness and robustness of the proposed algorithm.

Key words: semi-supervised dimensionality reduction, Self-Paced Learning(SPL), mapping, sparse representation, feature extraction