基于不完备集双聚类的缺失数据填补算法

doi:10.3969/j.issn.1000-3428.2016.04.004

计算机工程

基于不完备集双聚类的缺失数据填补算法

韩飞^a,沈镇林 ^b

(暨南大学 a.信息科学与技术学院; b.信息管理办公室,广州 510632)

收稿日期:2015-06-02 出版日期:2016-04-15 发布日期:2016-04-15
作者简介:韩飞(1990-),男,硕士,主研方向为大数据分析、数据清洗;沈镇林,教授级高级工程师。
基金资助:
广东省高新技术产业化基金资助项目(2011B080701046)。

Missing Data Filling Algorithm Based on Incomplete Set Biclustering

HAN Fei^a,SHEN Zhenlin^b

(a.College of Information Science and Technology; b.Office of Information Management,Jinan University,Guangzhou 510632,China)

Received:2015-06-02 Online:2016-04-15 Published:2016-04-15

摘要/Abstract

摘要： 缺失数据填补是数据清洗领域的一个重要问题。由于绝大部分局部填补方法基于全部属性进行分类,未考虑对象属性之间的关联性,因此基于不完备集双聚类,提出一种缺失数据填补算法。该算法利用双聚类完美簇的平均平方残基为0及簇内的属性值波动一致的特点,对缺失数据进行填补。通过数学分析,把寻找含有缺失值的最大完美簇问题转化为求解缺失对象与其他对象之间的最大相似属性集问题,在相同的最大相似属性集下,以缺失值的众数作为填补值。采用4组UCI数据集进行实验,结果表明,该算法相比ROUSTIDA算法平均提高了77.13%的填补值精确度。

关键词: 缺失数据填补, 不完备集, 双聚类, 最大相似属性集, 数据清洗, 完美簇

Abstract: Missing data filling is an important issue in the field of data cleaning.For the vast majority of local filling methods realize classification on the basis of all attributes without considering the correlation between object attributes,this paper puts forward a missing data filling algorithm based on incomplete set biclustering.This algorithm fills missing data based on the theory that the mean squared residue of biclustering perfect cluster is 0 and the fluctuation of the cluster’s attribute values is consistent.This paper translates the problem of finding the maximum perfect cluster which contains the missing values into the problem of finding out the maximum similarity attribute sets between the missing object and other objects through mathematical analysis,then the majority of missing values which is used as the filling value can be calculated by the same maximum similarity attribute sets.This paper takes experiments using 4 groups of UCI data sets,and it is demonstrated that the proposed algorithm averagely improves the accuracy of 77.13% filling values compared with ROUSTIDA algorithm.

Key words: missing data filling, incomplete set, biclustering, maximum similarity attribute set, data cleaning, perfect cluster

中图分类号:

TP311

韩飞,沈镇林. 基于不完备集双聚类的缺失数据填补算法[J]. 计算机工程.

HAN Fei,SHEN Zhenlin. Missing Data Filling Algorithm Based on Incomplete Set Biclustering[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2016/V42/I4/20

参考文献

参考文献［1］Garcia-Laencina P J,Sancho-Gomez J L,Figueiras-Vidal A R,et al.K Nearest Neighbours with Mutual Information for Simultaneous Classification and Missing Data Imputation［J］.Neurocomputing,2009,72(7-9):1483-1493. ［2］Siminski K.Neuro-rough-fuzzy Approach for Regression Modelling from Missing Data［J］.International Journal of Applied Mathematics and Computer Science,2012,22(2):461-476. ［3］Siminski K.Clustering with Missing Values［J］.Fundamenta Informaticae,2013,123(3):331-350. ［4］Han Jiawei,Kamber M.Data Mining Concepts and Techniques［M］.2nd ed.Pittsburgh,USA:Academic Press,2006. ［5］张建军,张天成,隋宇婷,等.基于极限学习机(ELM)岭回归的DNA微阵列数据填补［J］.小型微型计算机系统,2014,35(10):2357-2362. ［6］Bernaards C A,Sijtsma K.Influence of Imputation and EM Methods on Factor Analysis when Item Nonresponse in Questionnaire Data is Nonignorable［J］.Multivariate Behavioral Research,2000,35(3):321-364. ［7］张亚萍,陈得宝,侯俊钦,等.朴素贝叶斯分类算法的改进及应用［J］.计算机工程与应用,2011,47(15):134-137. ［8］武森,冯小东,单志广.基于不完备数据聚类的缺失数据填补方法［J］.计算机学报,2012,35(8):1726-1738. ［9］Huang Yanyan,Dong Lei,Wang Jun.Data Com-plementation Method and Application for Incom-plete Information System Based on Rough Sets［C］//Proceedings of International Conference on Internet Multimedia Computing and Service.Berlin,Germany:Springer,2011:531-534. ［10］郝胜轩,宋宏,周晓锋.基于近邻噪声处理的KNN缺失数据填补算法［J］.计算机仿真,2014,31(7):264-268. ［11］Chai Lian-en,Law Chow-kuan,Mohamad M S,et al.Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data［J］.The Malaysian Journal of Medical Sciences,2014,21(2):7-20. ［12］王凤梅,胡丽霞.一种基于近邻规则的缺失数据填补方法［J］.计算机工程,2012,38(21):53-55. ［13］Symeonidis P,Nanopoulos A,Papadopoulos A.Nearest Biclusters Collaborative Filtering with Constant Values［C］//Proceedings of the 8th International Workshop on Knowledge Discovery on the Web.Berlin,Germany:Springer,2007:36-55. ［14］Franca F,Ferreira H M,Zuben F J V.Applying Biclustering to Perform Collaborative Filtering［C］//Proceedings of the 7th International Conference on Intelligent Systems Design and Applications.New York,USA:ACM Press,2007:421-426. ［15］Castro P,Frana F,Ferreira H M,et al.Query Expansion Using an Immune-inspired Biclustering Algorithm［J］.Natural Computing,2010,9(3):579-602. ［16］Zhang Lihua,Wang Miao,Gu Qingfan,et al.CoCluster:Efficient Mining Maximal Trend Biclusters Without Candidate Maintenance in Discrete Resource Effec-tiveness Matrix［C］//Proceedings of the 1st Symposium on Aviation Maintenance and Management.Xi’an,China:Lecture Notes in Electrical Engineering,2014:1-11. ［17］Flores J L,Inaki I,Pedro L,et al.A New Measure for Gene Expression Biclustering Based on Non-parametric Correlation［J］.Computer Methods and Programs in Biomedicine,2013,112(3):367-397. ［18］Cheng Y,Church G M.Biclustering of Expression Data［C］//Proceedings of International Conference on Intelligent Systems for Molecular Biology.Washington D.C.,USA:IEEE Press,2000:93-103. 编辑顾逸斐

[1]	陈增照, 王政, 郑秋雨. 基于全范围头部姿态估计的教师注意力识别算法[J]. 计算机工程, 2024, 50(7): 96-103.
[2]	何俊, 张云飞, 张德海. 基于Petri网的数据清洗规则链自动组合与检测[J]. 计算机工程, 2020, 46(11): 124-131.
[3]	冉德彤,游宏梁. 一种基于标签传播的数据分块算法[J]. 计算机工程, 2017, 43(9): 51-55,61.
[4]	王方，李华，杜金玲. 无检测器道路交通流数据质量检测方法[J]. 计算机工程, 2014, 40(3): 218-223.
[5]	刘奇, 孟珍, 刘勇, 董慧, 林小光, 杲艳平, 周园春, 黎建辉. 基于BLAST的数据清洗与质量控制方案[J]. 计算机工程, 2011, 37(4): 73-75.
[6]	王晓原, 张敬磊, 吴芳. 交通流数据清洗规则研究[J]. 计算机工程, 2011, 37(20): 191-193.
[7]	翟学敏;刘渊;刘波;毕蓉蓉. 改进的XML智能数据清洗策略[J]. 计算机工程, 2009, 35(4): 66-68.
[8]	刘波;杨路明;雷刚跃;邓云龙. 面向XML数据库的智能数据清洗策略[J]. 计算机工程, 2008, 34(16): 16-18.
[9]	刘波;雷刚跃;杨路明;邓云龙. 基于非一致性数据库的概率查询策略与算法[J]. 计算机工程, 2008, 34(1): 69-71.
[10]	张永;迟忠先. 位置编码在数据仓库ETL中的应用[J]. 计算机工程, 2007, 33(01): 50-52.
[11]	叶舟;王东. 基于规则引擎的数据清洗[J]. 计算机工程, 2006, 32(23): 52-54.

选择文件类型/文献管理软件名称

选择包含的内容

基于不完备集双聚类的缺失数据填补算法

Missing Data Filling Algorithm Based on Incomplete Set Biclustering

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 11

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于不完备集双聚类的缺失数据填补算法

Missing Data Filling Algorithm Based on Incomplete Set Biclustering

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 11

编辑推荐

Metrics

本文评价