计算机工程

• 体系结构与软件技术 • 上一篇    下一篇

基于不完备集双聚类的缺失数据填补算法

韩飞  a,沈镇林  b   

  1. (暨南大学 a.信息科学与技术学院; b.信息管理办公室,广州 510632)
  • 收稿日期:2015-06-02 出版日期:2016-04-15 发布日期:2016-04-15
  • 作者简介:韩飞(1990-),男,硕士,主研方向为大数据分析、数据清洗;沈镇林,教授级高级工程师。
  • 基金项目:
    广东省高新技术产业化基金资助项目(2011B080701046)。

Missing Data Filling Algorithm Based on Incomplete Set Biclustering

HAN Fei  a,SHEN Zhenlin  b   

  1. (a.College of Information Science and Technology; b.Office of Information Management,Jinan University,Guangzhou 510632,China)
  • Received:2015-06-02 Online:2016-04-15 Published:2016-04-15

摘要: 缺失数据填补是数据清洗领域的一个重要问题。由于绝大部分局部填补方法基于全部属性进行分类,未考虑对象属性之间的关联性,因此基于不完备集双聚类,提出一种缺失数据填补算法。该算法利用双聚类完美簇的平均平方残基为0及簇内的属性值波动一致的特点,对缺失数据进行填补。通过数学分析,把寻找含有缺失值的最大完美簇问题转化为求解缺失对象与其他对象之间的最大相似属性集问题,在相同的最大相似属性集下,以缺失值的众数作为填补值。采用4组UCI数据集进行实验,结果表明,该算法相比ROUSTIDA算法平均提高了77.13%的填补值精确度。

关键词: 缺失数据填补, 不完备集, 双聚类, 最大相似属性集, 数据清洗, 完美簇

Abstract: Missing data filling is an important issue in the field of data cleaning.For the vast majority of local filling methods realize classification on the basis of all attributes without considering the correlation between object attributes,this paper puts forward a missing data filling algorithm based on incomplete set biclustering.This algorithm fills missing data based on the theory that the mean squared residue of biclustering perfect cluster is 0 and the fluctuation of the cluster’s attribute values is consistent.This paper translates the problem of finding the maximum perfect cluster which contains the missing values into the problem of finding out the maximum similarity attribute sets between the missing object and other objects through mathematical analysis,then the majority of missing values which is used as the filling value can be calculated by the same maximum similarity attribute sets.This paper takes experiments using 4 groups of UCI data sets,and it is demonstrated that the proposed algorithm averagely improves the accuracy of 77.13% filling values compared with ROUSTIDA algorithm.

Key words: missing data filling, incomplete set, biclustering, maximum similarity attribute set, data cleaning, perfect cluster

中图分类号: