基于聚类的连续型数据缺失值充填方法

doi:10.19678/j.issn.1000-3428.0053331

计算机工程 ›› 2019, Vol. 45 ›› Issue (9): 32-39. doi: 10.19678/j.issn.1000-3428.0053331

基于聚类的连续型数据缺失值充填方法

李国和^1a,1b, 杨绍伟^1a,1b, 吴卫江^1a,1b, 郑艺峰^1a,1b,2a,2b

1. 中国石油大学(北京) a. 石油数据挖掘北京市重点实验室;b. 地球物理与信息工程学院, 北京 102249;
2. 闽南师范大学 a. 数据科学与智能应用福建省高等学校重点实验室;b. 计算机学院, 福建漳州 363000

收稿日期:2018-12-06 修回日期:2019-02-22 出版日期:2019-09-15 发布日期:2019-09-03
作者简介:李国和(1965-),男,教授、博士、博士生导师,主研方向为人工智能、大数据、计算机图形技术;杨绍伟,硕士研究生;吴卫江,副教授、博士研究生;郑艺峰,博士研究生。
基金资助:
国家自然科学基金（61701213）；国家油气重点专项子课题（G-5800-08-ZS-WX）；中国石油大学（北京）克拉玛依校区科研启动基金（RCYJ2016B-03-001）；福建省教育厅中青年基金（JA15300）。

Clustering-based Missing Value Filling Method for Continuous Data

LI Guohe^1a,1b, YANG Shaowei^1a,1b, WU Weijiang^1a,1b, ZHENG Yifeng^1a,1b,2a,2b

1a. Beijing Key Lab of Petroleum Data Mining;1b. College of Geophysics and Information Engineering, China University of Petroleum(Beijing), Beijing 102249, China;
2a. Key Laboratory of Data Science and Intelligence Application;2b. School of Computer Sciences, Minnan Normal University, Zhangzhou, Fujian 363000, China

Received:2018-12-06 Revised:2019-02-22 Online:2019-09-15 Published:2019-09-03
Supported by:
This work is supported by Science and Technology Project of SGCC (No.SGSHJY00BGJS1400221).

摘要/Abstract

摘要： 在大数据应用中，多数建模方法是在完备数据集基础上进行的，但在数据采集过程或存储过程中容易出现数据缺失的现象，导致无法建模。为此，提出一种基于聚类的递归充填方法。使用同类簇的均值对不完备数据进行预填充，形成初始完备数据集，针对得到的完整数据进行聚类，并运用同类簇的均值修正初始充填值。根据充填效果误差判定充填稳定性，并进行多次递归聚类修正充填值，直到前后两次充填较为稳定或迭代次数超过阈值时停止迭代。实验结果表明，与均值充填、K最近邻充填、聚类充填及粗糙集不完备数据分析等方法相比，该方法能够进行更为精准的充填，使得最终充填更加接近真实数据。

关键词: 缺失值, 预充填, 聚类, 递归充填, 平方误差

Abstract: In big data applications,most modeling methods are based on a complete data set,but data missing in the data acquisition process or storing process tend to result in failure to modeling.Therefore,a clustering-based recursive filling method is proposed.The incomplete data is pre-filled using the mean of the same cluster to form an initial complete data set.The complete data obtained are clustered,and the initial filling is corrected using the mean of the same cluster.The filling stability is determined according to the deviation of filling results,and the filling value is corrected through multiple times of recursive clustering until the last two times of filling is stable or the number of iterations exceeds the threshold.Experimental results show that compared with the methods of mean filling,K nearest neighbor filling,cluster filling and incomplete data analysis for rough sets,the method can implement more precise filling,making the final filling more close to real data.

Key words: missing value, prefilling, clustering, recursive filling, square error

中图分类号:

TP301.6

李国和, 杨绍伟, 吴卫江, 郑艺峰. 基于聚类的连续型数据缺失值充填方法[J]. 计算机工程, 2019, 45(9): 32-39.

LI Guohe, YANG Shaowei, WU Weijiang, ZHENG Yifeng. Clustering-based Missing Value Filling Method for Continuous Data[J]. Computer Engineering, 2019, 45(9): 32-39.

https://www.ecice06.com/CN/Y2019/V45/I9/32

图/表 13

20190912182027

20190912182031

20190912182034

20190912182036

20190912182040

20190912182042

20190912182046

20190912182049

20190912182051

20190912182054

20190912182057

20190912182103

20190912182106

参考文献

[1] LIU Zhunga,PAN Quan,DEZERT J,et al.Adaptive imputation of missing values for incomplete pattern classification[J].Pattern Recognition,2016,52(C):85-95.
[2] 高科,刁兴春,曹建军.含缺失属性值的问题数据检测与修复[J].计算机工程与设计,2016,37(3):643-649.
[3] 韩飞,沈镇林.基于不完备集双聚类的缺失数据填补算法[J].计算机工程,2016,42(4):20-26.
[4] ARMINA R,ZAIN A M,ALI N A,et al.A review on missing value estimation using imputation algorithm[C]//Proceedings of JPCS'17.Washington D.C.,USA:IEEE Press,2017:125-136.
[5] YAN Xiaobo,XIONG Weiqing,HU Liang,et al.Missing value imputation based on gaussian mixture model for the internet of things[J].Mathematical Problems in Engineering,2015(3):1-8.
[6] PURWAR A,SINGH S K.Hybrid prediction model with missing value imputation for medical data[J].Expert Systems with Applications,2015,42(13):5621-5631.
[7] XUE Wenzhuo,LI Hui,PENG Yanguo,et al.Secure k nearest neighbors query for high-dimensional vectors in outsourced environments[J].IEEE Transactions on Big Data,2017(99):1.
[8] 杨涛,骆嘉伟,王艳.基于马氏距离的缺失值充填算法[J].计算机应用,2005,25(12):2868-2871.
[9] LIEW W C,LAW N F.Missing value imputation for gene expression data:computational techniques to recover missing data from available information[J].Briefings in Bioinformatics,2011,12(5):498-513.
[10] 卜范玉,陈志奎.基于聚类和自动编码机的缺失数据充填算法[J].计算机工程与应用,2015,51(18):13-17.
[11] SIMINSKI K.Neuro-rough-fuzzy approach for regression modelling from missing data[J].International Journal of Applied Mathematics and Computer Science,2012,22(2):461-476.
[12] 焦媛.云计算下多维数据缺失特征填补仿真研究[J].计算机仿真,2018,35(2):262-265.
[13] WANG Changzhong,QI Yali,SHAO Mingwen,et al.A fitting model for feature selection with fuzzy rough sets[J].IEEE Transactions on Fuzzy Systems,2017,25(4):741-753.
[14] 顾爱华.云计算网络中高维数据标准化处理优化仿真[J].计算机仿真,2017,34(3):317-320.
[15] KISORE N R,KOTESWARAIAH C B.Improving ATM coverage area using density based clustering algorithm and voronoi diagrams[J].Information Sciences,2017,376:1-20.
[16] VINH N X,ZHOU Shuo,CHAN J,et al.Can high-order dependencies improve mutual information based feature selection?[J].Pattern Recognition,2016,53(C):46-58.
[17] 牛咏梅.基于粗糙集的海量数据挖掘算法研究[J].现代电子技术,2016,39(7):115-119.
[18] YUAN Jingling,CHEN Mincheng,JIANG Tao,et al.Complete tolerance relation based parallel filling for incomplete energy big data[J].Knowledge-Based Systems,2017,132:1-23.
[19] YANG Jie,MA Yan,ZHANG Xiangfen,et al.An initialization method based on hybrid distance for k-means algorithm[J].Neural Computation,2017,29(11):1-24.
[20] AHMADIAN S,NOROUZI-FARD A,SVENSSON O,et al.Better guarantees for k-means and euclidean k-median by primal-dual algorithms[C]//Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science.Washinton D.C.,USA:IEEE Press,2017:61-72.
[21] MAILLO J,RAMIREZ S,TRIGUERO I,et al.kNN-IS:an iterative spark-based design of the k-nearest neighbors classifier for big data[J].Knowledge-Based Systems,2016,99:1-21.

选择文件类型/文献管理软件名称

选择包含的内容

基于聚类的连续型数据缺失值充填方法

Clustering-based Missing Value Filling Method for Continuous Data

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	徐明亮, 李芳媛, 马浩然, 何飞. 大规模神经记录的峰电位聚类算法(特邀)[J]. 计算机工程, 2024, 50(6): 1-34.
[2]	胡傲然, 陈晓红. 基于多样性与一致性的单步多视图聚类[J]. 计算机工程, 2024, 50(5): 51-61.
[3]	马越, 温蜜. 基于多尺度LDTW和TCN的空间负荷预测方法[J]. 计算机工程, 2024, 50(3): 106-113.
[4]	宋华伟, 李升起, 万方杰, 卫玉萍. 非独立同分布场景下的联邦学习优化方法[J]. 计算机工程, 2024, 50(3): 166-172.
[5]	王丽娟, 邢津萍, 尹明, 郝志峰, 蔡瑞初, 温雯. 基于一致性图的权重自适应多视角谱聚类算法[J]. 计算机工程, 2024, 50(2): 122-131.
[6]	刘思慧, 高全学, 宋伟, 谢德燕. 基于加权张量低秩约束的多视图谱聚类[J]. 计算机工程, 2024, 50(1): 129-137.
[7]	郑美光, 杨泳. 基于互信息软聚类的个性化联邦学习算法[J]. 计算机工程, 2023, 49(8): 20-28.
[8]	李泽水, 冀俊忠, 杨翠翠. 基于边权重信息深度网络嵌入的PPIN功能模块检测[J]. 计算机工程, 2023, 49(8): 69-76.
[9]	江雨燕, 陶承凤, 李平. 数据增强和自适应自步学习的深度子空间聚类算法[J]. 计算机工程, 2023, 49(8): 96-103, 110.
[10]	邱天晨, 郑小盈, 祝永新, 封松林. 面向非独立同分布数据的联邦学习架构[J]. 计算机工程, 2023, 49(7): 110-117.
[11]	位雅, 张正军, 何凯琳, 唐莉. 基于相对密度的密度峰值聚类算法[J]. 计算机工程, 2023, 49(6): 53-61.
[12]	戴浩磊, 黄永慧, 周郭许. 基于超图正则化非负张量链分解的聚类分析[J]. 计算机工程, 2023, 49(6): 81-89.
[13]	高小方, 原玉梁, 温静, 白雪飞. 面向相交多流形聚类的标签传播算法[J]. 计算机工程, 2023, 49(6): 90-98.
[14]	李晓腾, 张盼盼, 勾智楠, 高凯. 基于多任务学习的多模态命名实体识别方法[J]. 计算机工程, 2023, 49(4): 114-119.
[15]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于聚类的连续型数据缺失值充填方法

Clustering-based Missing Value Filling Method for Continuous Data

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献

相关文章 15

编辑推荐

Metrics

本文评价