基于多模型融合的不完整数据分数插补算法

doi:10.19678/j.issn.1000-3428.0065611

摘要/Abstract

摘要：

缺失数据插补是从不完整数据集中进行数据挖掘的重要步骤，现有插补算法无法有效利用高缺失率的样本，存在等效处理缺失率不同的样本、假设缺失数据与完整数据同分布问题。构建基于多模型融合的不完整数据分数插补算法FIB。根据噪声标签学习，提出新的样本评分方式，以输出样本分数，通过建立机器学习模型将该分数作为分数样本权重，减小不可靠样本对模型性能的影响，并借鉴伪标签技术，使用高缺失率样本生成伪标签数据。将伪标签数据扩充至插补结果，形成待合并的单元插补结果，利用多个插补算法将单元插补结果融合生成最终插补结果。在12个公开UCI数据集上的实验结果表明，相比传统插补算法，使用样本评分、生成伪标签数据及多模型融合这3种新技术使插补效果分别平均相对提升2.35%、5.89%及7.78%，相比DIM，FIB的平均准确率相对提升8.39%。此外，随着模型个数的增加, 插补效果也会相应增加，对于分类任务，5个模型融合的插补效果比2个模型的准确率平均相对提升11%，对于回归任务，R2得分平均相对提升15%。

关键词: 缺失数据插补, 多模型融合, 伪标签, 噪声标签学习, 数据挖掘

Abstract:

Missing data imputation is an important step in data mining from incomplete datasets. Existing imputation algorithms cannot effectively utilize samples with high missing rates, which results in the equivalent processing of samples with different missing rates, assuming that missing and complete data are distributed identically. An incomplete data fractional imputation algorithm FIB based on multi-model fusion is constructed. Based on noise label learning, a new sample scoring method is proposed to output sample scores. Subsequently, a machine learning model is established to use this score as the weight of the score sample, to reduce the impact of unreliable samples on model performance. Using pseudo-label technology as reference, high missing rate samples are then used to generate pseudo-label data. The pseudo-label data are further expanded to the imputation results to determine the unit imputation results to be merged, whereby multiple imputation algorithms are used to fuse the unit imputation results to generate the final imputation result. The experimental results on 12 publicly available UCI datasets show that, on the basis of traditional imputation algorithms, the three new technologies: sample scoring, generating pseudo-label data, and multi-model fusion, affording average relative improvements of 2.35%, 5.89%, and 7.78%, respectively. Compared with DIM, the average accuracy of FIB is relatively improved by 8.39%. In addition, as the number of models increases, the imputation effect also increases. For classification tasks, the imputation effect of five-model fusion provides an average relative improvement of 11% compared to the average accuracy of two models, and for regression tasks, the R2 score is an average relative improvement of 15%.

Key words: missing data imputation, multi-model fusion, pseudo-label, noise label learning, data mining

邵良杉, 赵松泽. 基于多模型融合的不完整数据分数插补算法[J]. 计算机工程, 2023, 49(9): 79-88, 98.

Liangshan SHAO, Songze ZHAO. Fractional Imputation Algorithm for Incomplete Data Based on Multi-Model Fusion[J]. Computer Engineering, 2023, 49(9): 79-88, 98.

https://www.ecice06.com/CN/Y2023/V49/I9/79

图/表 10

图1 FIB算法流程

Fig.1 Procedure of FIB algorithm

图2 D_n与交叉熵区分正误插补样本的效果对比

Fig.2 Comparison of the effect of D_n and cross entropy in distinguishing correct and false imputation samples

图3 基于D_n的函数曲线图

Fig.3 Function graph based on D_n

图4 模型个数与插补得分的关系

Fig.4 Relationship between the number of models and imputation scores

图5 超参数α对数据集插补性能的影响

Fig.5 Influence of hyperparametric α on imputation performance of datasets

参考文献 45

1	诸葛文章, 范瑞东, 罗廷金, 等. 基于独立自表达学习的不完全多视图聚类. 中国科学: 信息科学, 2022, 52 (7): 1186- 1203. URL
	ZHUGE W Z, FAN R D, LUO T J, et al. Incomplete multi-view clustering via independent self-representation learning. Scientia Sinica (Informationis), 2022, 52 (7): 1186- 1203. URL
2	刘永裕, 巩晓婷, 方炜杰, 等. 数据缺失的扩展置信规则库推理方法. 计算机研究与发展, 2022, 59 (3): 661- 673. URL
	LIU Y Y, GONG X T, FANG W J, et al. Extended belief rule base reasoning approach with missing data. Journal of Computer Research and Development, 2022, 59 (3): 661- 673. URL
3	刘彦雯, 张金鑫, 张宏杰, 等. 基于双重局部保持的不完整多视角嵌入学习方法. 计算机工程, 2021, 47 (6): 115-122, 141. URL
	LIU Y W, ZHANG J X, ZHANG H J, et al. Incomplete multi-view embedded learning method based on double locality preserving. Computer Engineering, 2021, 47 (6): 115-122, 141. URL
4	ZHU X F, YANG J Y, ZHANG C Y, et al. Efficient utilization of missing data in cost-sensitive learning. IEEE Transactions on Knowledge and Data Engineering, 2021, 33 (6): 2425- 2436. doi: 10.1109/TKDE.2019.2956530
5	刘永楠, 李建中, 高宏. 海量不完整数据的核心数据选择问题的研究. 计算机学报, 2018, 41 (4): 915- 930. URL
	LIU Y N, LI J Z, GAO H. Research on core-sets selection on massive incomplete data. Chinese Journal of Computers, 2018, 41 (4): 915- 930. URL
6	CARRERAS G, MICCINESI G, WILCOCK A, et al. Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study. BMC Medical Research Methodology, 2021, 21 (1): 1- 10. doi: 10.1186/s12874-020-01190-w
7	CHEN J J, HUNTER S, KISFALVI K, et al. A hybrid approach of handling missing data under different missing data mechanisms: visible 1 and VARSITY trials for ulcerative colitis. Contemporary Clinical Trials, 2021, 100, 106226. doi: 10.1016/j.cct.2020.106226
8	HUANG C T, CHANG R C, TSAI Y L, et al. Entropy-based time window features extraction for machine learning to predict acute kidney injury in ICU. Applied Sciences, 2021, 11 (14): 6364. doi: 10.3390/app11146364
9	JOSEFSSON M, DANIELS M J. Bayesian semi-parametric G-computation for causal inference in a cohort study with mnar dropout and death. Journal of the Royal Statistical Society Series C: Applied Statistics, 2021, 70 (2): 398- 414. doi: 10.1111/rssc.12464
10	ZONG F X, LI R B. Bayesian estimation of log-normal distribution under ranked set sampling with missing data. IEEE Access, 2021, 9, 108112- 108118. doi: 10.1109/ACCESS.2021.3101204
11	PILNENSKIY N, SMETANNIKOV I. Feature selection algorithms as one of the Python data analytical tools. Future Internet, 2020, 12 (3): 54. doi: 10.3390/fi12030054
12	VAN BUUREN S, GROOTHUIS-OUDSHOORN K. Mice: multivariate imputation by chained equations in R. Journal of Statistical Software, 2011, 45 (3): 1- 67.
13	OLGA T, MICHAEL C, GAVIN S, et al. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17 (6): 520- 525. doi: 10.1093/bioinformatics/17.6.520
14	PENG C Y J, ZHU J. Comparison of two approaches for handling missing covariates in logistic regression. Educational and Psychological Measurement, 2008, 68 (1): 58- 77. doi: 10.1177/0013164407305582
15	冷泳林, 陈志奎, 张清辰, 等. 不完整大数据的分布式聚类填充算法. 计算机工程, 2015, 41 (5): 19- 25. URL
	LENG Y L, CHEN Z K, ZHANG Q C, et al. Distributed clustering and filling algorithm of incomplete big data. Computer Engineering, 2015, 41 (5): 19- 25. URL
16	SONG Q B, SHEPPERD M, CHEN X R, et al. Can k-NN imputation improve the performance of C4.5 with small software project data sets?A comparative evaluation. Journal of Systems and Software, 2008, 81 (12): 2361- 2370. doi: 10.1016/j.jss.2008.05.008
17	BEN OMEGA P, HUGO N, FERNANDO L B, et al. Evaluation of different approaches for missing data imputation on features associated to genomic data. BioData Mining, 2021, 14 (1): 44. doi: 10.1186/s13040-021-00274-7
18	KEERIN P, BOONGOEN T. Improved KNN imputation for missing values in gene expression data. Computers, Materials & Continua, 2022, 70 (2): 4009- 4025.
19	LEE J Y, STYCZYNSKI M P. NS-KNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics, 2018, 14 (12): 1- 12.
20	SAHOO A, GHOSE D K. Imputation of missing precipitation data using KNN, SOM, RF, and FNN. Soft Computing, 2022, 26 (12): 5919- 5936. doi: 10.1007/s00500-022-07029-4
21	刘晓琳, 白亮, 赵兴旺, 等. 基于多阶近邻融合的不完整多视图聚类算法. 软件学报, 2022, 33 (4): 1354- 1372. doi: 10.13328/j.cnki.jos.006471
	LIU X L, BAI L, ZHAO X W, et al. Incomplete multi-view clustering algorithm based on multi-order neighborhood fusion. Journal of Software, 2022, 33 (4): 1354- 1372. doi: 10.13328/j.cnki.jos.006471
22	ZHANG S C, LI X L, ZONG M, et al. Efficient KNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29 (5): 1774- 1785. doi: 10.1109/TNNLS.2017.2673241
23	BUCK S F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society: Series B (Methodological), 1960, 22 (2): 302- 306. doi: 10.1111/j.2517-6161.1960.tb00375.x
24	DEBUSE J C, RAYWARD-SMITH V J. Discretisation of continuous commercial database features for a simulated annealing data mining algorithm. Applied Intelligence, 1999, 11 (3): 285- 295. doi: 10.1023/A:1008339026836
25	GAMA J, TORGO L, SOARES C. Dynamic discretization of continuous attributes[C]// Proceedings of the 6th Ibero-American Conference on AI: Progress in Artificial Intelligence. New York, USA: ACM Press, 1998: 160-169.
26	LIU H, HUSSAIN F, TAN C L, et al. Discretization: an enabling technique. Data Mining and Knowledge Discovery, 2002, 6 (4): 393- 423. doi: 10.1023/A:1016304305535
27	JIN L A, BI Y T, HU C Q, et al. A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports, 2021, 11, 1760. doi: 10.1038/s41598-021-81279-4
28	MOHAMMED Y S, ABDELKADER H, PŁAWIAK P, et al. A novel model to optimize multiple imputation algorithm for missing data using evolution methods. Biomedical Signal Processing and Control, 2022, 76, 103661. doi: 10.1016/j.bspc.2022.103661
29	OCAMPO A, SCHMIDLI H, QUARG P, et al. Identifying treatment effects using trimmed means when data are missing not at random. Pharmaceutical Statistics, 2021, 20 (6): 1265- 1277. doi: 10.1002/pst.2147
30	LI Z X, QIN L, CHENG H, et al. TRIP: an interactive retrieving-inferring data imputation approach. IEEE Transactions on Knowledge and Data Engineering, 2015, 27 (9): 2550- 2563. doi: 10.1109/TKDE.2015.2411276
31	YANG Z Z, XUE F, LU W S. Handling missing data for construction waste management: machine learning based on aggregated waste generation behaviors. Resources, Conservation and Recycling, 2021, 175, 105809. doi: 10.1016/j.resconrec.2021.105809
32	ZHANG T, ZHANG D G, YAN H R, et al. A new method of data missing estimation with FNN-based tensor heterogeneous ensemble learning for Internet of vehicle. Neurocomputing, 2021, 420, 98- 110. doi: 10.1016/j.neucom.2020.09.042
33	SHAN S L, LI Z X, LI Y, et al. WebPut: a web-aided data imputation system for the general type of missing string attribute values[C]//Proceedings of the 35th International Conference on Data Engineering. Washington D. C., USA: IEEE Press, 2019: 1952-1955.
34	郝川艳, 陈亚当, 吴雯, 等. 透视场景的图像填补方法. 计算机辅助设计与图形学学报, 2016, 28 (4): 662- 668. doi: 10.3969/j.issn.1003-9775.2016.04.017
	HAO C Y, CHEN Y D, WU W, et al. An efficient image editing method for perspective scenes. Journal of Computer-Aided Design & Computer Graphics, 2016, 28 (4): 662- 668. doi: 10.3969/j.issn.1003-9775.2016.04.017
35	鲁统伟, 徐子昕, 闵锋. 基于生成对抗网络的知识蒸馏数据增强. 计算机工程, 2022, 48 (4): 70- 80. URL
	LU T W, XU Z X, MIN F. Knowledge distillation data augmentation based on generation adversarial network. Computer Engineering, 2022, 48 (4): 70- 80. URL
36	PHAM T M, CARPENTER J R, MORRIS T P, et al. Population-calibrated multiple imputation for a binary/categorical covariate in categorical regression models. Statistics in Medicine, 2019, 38 (5): 792- 808. doi: 10.1002/sim.8004
37	LIU C W. Examining nonnormal latent variable distributions for non-ignorable missing data. Applied Psychological Measurement, 2021, 45 (3): 159- 177. doi: 10.1177/0146621621990753
38	CHENG C H, HUANG S F. A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft Computing, 2021, 25 (17): 11781- 11801. doi: 10.1007/s00500-021-05947-3
39	VON HIPPEL P T. How many imputations do you need? a two-stage calculation using a quadratic rule. Sociological Methods & Research, 2020, 49 (3): 699- 718.
40	SER G, KESKIN S, YILMAZ M C. The performance of multiple imputations for different number of imputations. Sains Malaysiana, 2016, 45 (11): 1755- 1761.
41	KALTON G, KISH L. Some efficient random imputation methods. Communications in Statistics - Theory and Methods, 1984, 13 (16): 1919- 1939. doi: 10.1080/03610928408828805
42	佘朝阳, 严馨, 徐广义, 等. 融合数据增强与半监督学习的药物不良反应检测. 计算机工程, 2022, 48 (6): 314- 320. URL
	SHE Z Y, YAN X, XU G Y, et al. Adverse drug reaction detection combined with data augmentation and semi-supervised learning. Computer Engineering, 2022, 48 (6): 314- 320. URL
43	ARAZO E, ORTEGO D, ALBERT P, et al. Unsupervised label noise modeling and loss correction[EB/OL]. [2022-07-20]. https://arxiv.org/abs/1904.11238.
44	LIU T L, TAO D C. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38 (3): 447- 461.
45	WU H, PRASAD S. Semi-supervised deep learning using pseudo labels for hyperspectral image classification. IEEE Transactions on Image Processing, 2018, 27 (3): 1259- 1270.

[1]	徐晓滨, 张云硕, 施凡, 常雷雷, 陶志刚. 基于特征匹配度与异类子模型融合的安全性评估方法[J]. 计算机工程, 2024, 50(8): 113-122.
[2]	虞秋辰, 周若华, 袁庆升. 基于Ghost-SE-Res2Net的多模型融合语音唤醒词检测方法[J]. 计算机工程, 2024, 50(3): 52-59.
[3]	李军侠, 王星驰, 殷梓, 石德硕. 边缘深度挖掘的弱监督显著性目标检测[J]. 计算机工程, 2023, 49(7): 169-178.
[4]	席荣康, 蔡满春, 芦天亮. 基于数据增强与流数据处理的Tor流量分析模型[J]. 计算机工程, 2023, 49(3): 177-184.
[5]	何悦, 陈广胜, 景维鹏, 徐泽堃. 基于深度多相似性哈希方法的遥感图像检索[J]. 计算机工程, 2023, 49(2): 206-212.
[6]	富坤, 孙明磊, 郝玉涵, 刘赢华. 基于对抗训练的伪标签约束自编码器[J]. 计算机工程, 2023, 49(11): 123-130.
[7]	雷洁, 饶文碧, 杨焱超, 熊盛武. 基于分类不确定性的伪标签目标检测算法[J]. 计算机工程, 2023, 49(1): 49-56.
[8]	谷青竹, 董红斌. PPDM中面向k-匿名的MI Loss评估模型[J]. 计算机工程, 2022, 48(4): 143-147.
[9]	王璐, 刘晓清, 何震瀛. 连续时间区间内的频繁词序列挖掘算法[J]. 计算机工程, 2022, 48(2): 79-85,91.
[10]	张攀, 高丰, 周逸, 饶涵宇, 毛冬, 李静. 一种在线实时微服务调用链异常检测方法[J]. 计算机工程, 2022, 48(11): 161-169.
[11]	吴军, 欧阳艾嘉, 张琳. 面向置换检验的冗余对比模式过滤算法[J]. 计算机工程, 2022, 48(1): 75-84.
[12]	吴军, 欧阳艾嘉, 张琳. 面向对比序列模式发现的独立精确置换检验算法[J]. 计算机工程, 2021, 47(8): 45-53,61.
[13]	杜诗晴, 王鹏, 汪卫. 一种基于MDL的日志序列模式挖掘算法[J]. 计算机工程, 2021, 47(2): 118-125.
[14]	魏文浩, 唐泽坤, 刘刚. 基于距离和密度的PBK-means算法[J]. 计算机工程, 2020, 46(9): 68-75.
[15]	史明阳, 王鹏, 汪卫. 有监督时间序列分割与状态识别算法[J]. 计算机工程, 2020, 46(5): 131-138.

选择文件类型/文献管理软件名称

选择包含的内容