作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (9): 79-88, 98. doi: 10.19678/j.issn.1000-3428.0065611

• 人工智能与模式识别 • 上一篇    下一篇

基于多模型融合的不完整数据分数插补算法

邵良杉1,2, 赵松泽1,*   

  1. 1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105
    2. 辽宁工程技术大学 系统工程研究所, 辽宁 葫芦岛 125105
  • 收稿日期:2022-08-28 出版日期:2023-09-15 发布日期:2023-09-14
  • 通讯作者: 赵松泽
  • 作者简介:

    邵良杉(1961—),男,教授、博士、博士生导师,主研方向为矿山系统工程、数据挖掘

  • 基金资助:
    国家自然科学基金(71771111)

Fractional Imputation Algorithm for Incomplete Data Based on Multi-Model Fusion

Liangshan SHAO1,2, Songze ZHAO1,*   

  1. 1. School of Software, Liaoning Technical University, Huludao 125105, Liaoning, China
    2. Institute of Systems Engineering, Liaoning Technical University, Huludao 125105, Liaoning, China
  • Received:2022-08-28 Online:2023-09-15 Published:2023-09-14
  • Contact: Songze ZHAO

摘要:

缺失数据插补是从不完整数据集中进行数据挖掘的重要步骤,现有插补算法无法有效利用高缺失率的样本,存在等效处理缺失率不同的样本、假设缺失数据与完整数据同分布问题。构建基于多模型融合的不完整数据分数插补算法FIB。根据噪声标签学习,提出新的样本评分方式,以输出样本分数,通过建立机器学习模型将该分数作为分数样本权重,减小不可靠样本对模型性能的影响,并借鉴伪标签技术,使用高缺失率样本生成伪标签数据。将伪标签数据扩充至插补结果,形成待合并的单元插补结果,利用多个插补算法将单元插补结果融合生成最终插补结果。在12个公开UCI数据集上的实验结果表明,相比传统插补算法,使用样本评分、生成伪标签数据及多模型融合这3种新技术使插补效果分别平均相对提升2.35%、5.89%及7.78%,相比DIM,FIB的平均准确率相对提升8.39%。此外,随着模型个数的增加, 插补效果也会相应增加,对于分类任务,5个模型融合的插补效果比2个模型的准确率平均相对提升11%,对于回归任务,R2得分平均相对提升15%。

关键词: 缺失数据插补, 多模型融合, 伪标签, 噪声标签学习, 数据挖掘

Abstract:

Missing data imputation is an important step in data mining from incomplete datasets. Existing imputation algorithms cannot effectively utilize samples with high missing rates, which results in the equivalent processing of samples with different missing rates, assuming that missing and complete data are distributed identically. An incomplete data fractional imputation algorithm FIB based on multi-model fusion is constructed. Based on noise label learning, a new sample scoring method is proposed to output sample scores. Subsequently, a machine learning model is established to use this score as the weight of the score sample, to reduce the impact of unreliable samples on model performance. Using pseudo-label technology as reference, high missing rate samples are then used to generate pseudo-label data. The pseudo-label data are further expanded to the imputation results to determine the unit imputation results to be merged, whereby multiple imputation algorithms are used to fuse the unit imputation results to generate the final imputation result. The experimental results on 12 publicly available UCI datasets show that, on the basis of traditional imputation algorithms, the three new technologies: sample scoring, generating pseudo-label data, and multi-model fusion, affording average relative improvements of 2.35%, 5.89%, and 7.78%, respectively. Compared with DIM, the average accuracy of FIB is relatively improved by 8.39%. In addition, as the number of models increases, the imputation effect also increases. For classification tasks, the imputation effect of five-model fusion provides an average relative improvement of 11% compared to the average accuracy of two models, and for regression tasks, the R2 score is an average relative improvement of 15%.

Key words: missing data imputation, multi-model fusion, pseudo-label, noise label learning, data mining