作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (6): 57-64. doi: 10.19678/j.issn.1000-3428.0061524

• 人工智能与模式识别 • 上一篇    下一篇

基于概率矩阵分解的不完整数据集特征选择方法

范林歌, 武欣嵘, 童玮, 曾维军   

  1. 中国人民解放军陆军工程大学 通信工程学院, 南京 210007
  • 收稿日期:2021-04-30 修回日期:2021-07-08 发布日期:2021-07-13
  • 作者简介:范林歌(1997—),女,硕士研究生,主研方向为机器学习、数据质量;武欣嵘、童玮,副教授、硕士;曾维军(通信作者),讲师、博士。
  • 基金资助:
    国家自然科学基金(61802425)。

Feature Selection Method for Incomplete Data Sets Based on Probability Matrix Decomposition

FAN Linge, WU Xinrong, TONG Wei, ZENG Weijun   

  1. College of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, China
  • Received:2021-04-30 Revised:2021-07-08 Published:2021-07-13

摘要: 在机器学习理论与应用中,特征选择是降低高维数据特征维度的常用方法之一。传统的特征选择方法多数基于完整数据集,对实际应用中普遍存在缺失数据的情形研究较少。针对不完整数据中含有未被观察信息和存在异常值的特点,提出一种基于概率矩阵分解技术的鲁棒特征选择方法。使用基于分簇的概率矩阵分解模型对数据集中的缺失值进行近似估计,以有效测量相邻簇之间数据的相似性,缩小问题规模,同时降低填充误差。依据缺失数据值存在少量异常值的情形,利用基于l2,1损失函数的方法进行特征选择,在此基础上给出不完整数据集的特征选择方法流程,并对其收敛性进行理论分析。该方法利用不完整数据集中的所有信息,有效应对不完整数据集中异常值带来的影响。实验结果表明,相比传统特征选择方法,该方法在合成数据集上选择更少的无关特征,可降低异常值带来的影响,在真实数据集上获得了较高的分类准确率,能够选择出更为准确的特征。

关键词: 矩阵分解, 缺失值填补, 鲁棒特征选择, 不完整数据, 12,1范数

Abstract: In machine learning theory and application, feature selection is one of the common methods of reducing the feature dimension of high-dimensional data.Traditional feature selection methods are mostly based on complete data sets, and a few studies have been conducted on missing data in practical applications.In this study, a robust feature selection method is proposed based on Probability Matrix Decomposition(PMF) for incomplete data containing unobserved information and outliers.First, a probabilistic matrix decomposition model, based on clustering, is used to approximate the missing values in the data set.The model can effectively measure data similarity between adjacent clusters, reduce the scale of the problem, and reduce the imputation error.Secondly, the feature selection method, based on loss function, is used in the case involving missing data values with a few outliers.Finally, the flow of feature selection method for incomplete data sets is constructed, and its convergence is theoretically analyzed.The method proposed in this studyutilizes all the information in incomplete data sets and effectively deals with the influence of outliers in incomplete data sets.Experimental results show that when compared with traditional feature selection methods, the proposed method can select fewer irrelevant features in the synthetic data set and reduce the influence of outliers.Conversely, on real data sets, the proposed method realizes higher classification accuracy and selects more accurate features.

Key words: matrix decomposition, missing value filling, Robust Feature Selection(RFS), incomplete data, l2,1 norm

中图分类号: