计算机工程 ›› 2019, Vol. 45 ›› Issue (9): 32-39.doi: 10.19678/j.issn.1000-3428.0053331

• 先进计算与数据处理 • 上一篇    下一篇

基于聚类的连续型数据缺失值充填方法

李国和1a,1b, 杨绍伟1a,1b, 吴卫江1a,1b, 郑艺峰1a,1b,2a,2b   

  1. 1. 中国石油大学(北京) a. 石油数据挖掘北京市重点实验室;b. 地球物理与信息工程学院, 北京 102249;
    2. 闽南师范大学 a. 数据科学与智能应用福建省高等学校重点实验室;b. 计算机学院, 福建 漳州 363000
  • 收稿日期:2018-12-06 修回日期:2019-02-22 出版日期:2019-09-15 发布日期:2019-09-03
  • 作者简介:李国和(1965-),男,教授、博士、博士生导师,主研方向为人工智能、大数据、计算机图形技术;杨绍伟,硕士研究生;吴卫江,副教授、博士研究生;郑艺峰,博士研究生。
  • 基金项目:
    国家自然科学基金(61701213);国家油气重点专项子课题(G-5800-08-ZS-WX);中国石油大学(北京)克拉玛依校区科研启动基金(RCYJ2016B-03-001);福建省教育厅中青年基金(JA15300)。

Clustering-based Missing Value Filling Method for Continuous Data

LI Guohe1a,1b, YANG Shaowei1a,1b, WU Weijiang1a,1b, ZHENG Yifeng1a,1b,2a,2b   

  1. 1a. Beijing Key Lab of Petroleum Data Mining;1b. College of Geophysics and Information Engineering, China University of Petroleum(Beijing), Beijing 102249, China;
    2a. Key Laboratory of Data Science and Intelligence Application;2b. School of Computer Sciences, Minnan Normal University, Zhangzhou, Fujian 363000, China
  • Received:2018-12-06 Revised:2019-02-22 Online:2019-09-15 Published:2019-09-03
  • Supported by:
    This work is supported by Science and Technology Project of SGCC (No.SGSHJY00BGJS1400221).

摘要: 在大数据应用中,多数建模方法是在完备数据集基础上进行的,但在数据采集过程或存储过程中容易出现数据缺失的现象,导致无法建模。为此,提出一种基于聚类的递归充填方法。使用同类簇的均值对不完备数据进行预填充,形成初始完备数据集,针对得到的完整数据进行聚类,并运用同类簇的均值修正初始充填值。根据充填效果误差判定充填稳定性,并进行多次递归聚类修正充填值,直到前后两次充填较为稳定或迭代次数超过阈值时停止迭代。实验结果表明,与均值充填、K最近邻充填、聚类充填及粗糙集不完备数据分析等方法相比,该方法能够进行更为精准的充填,使得最终充填更加接近真实数据。

关键词: 缺失值, 预充填, 聚类, 递归充填, 平方误差

Abstract: In big data applications,most modeling methods are based on a complete data set,but data missing in the data acquisition process or storing process tend to result in failure to modeling.Therefore,a clustering-based recursive filling method is proposed.The incomplete data is pre-filled using the mean of the same cluster to form an initial complete data set.The complete data obtained are clustered,and the initial filling is corrected using the mean of the same cluster.The filling stability is determined according to the deviation of filling results,and the filling value is corrected through multiple times of recursive clustering until the last two times of filling is stable or the number of iterations exceeds the threshold.Experimental results show that compared with the methods of mean filling,K nearest neighbor filling,cluster filling and incomplete data analysis for rough sets,the method can implement more precise filling,making the final filling more close to real data.

Key words: missing value, prefilling, clustering, recursive filling, square error

中图分类号: