作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于冗余实例对消除算法的实例选择

刘 璐,高 强,刘衍珩,孙 鑫   

  1. (吉林大学计算机科学与技术学院,长春 130012)
  • 收稿日期:2013-01-15 出版日期:2014-01-15 发布日期:2014-01-13
  • 作者简介:刘 璐(1988-),女,硕士研究生,主研方向:机器学习,网络安全;高 强,教授;刘衍珩,教授、博士、博士生导师;孙 鑫,博士研究生
  • 基金资助:
    国家自然科学基金资助项目(60973136)

Instance Selection Based on Redundant Instance Pair Elimination Algorithm

LIU Lu, GAO Qiang, LIU Yan-heng, SUN Xin   

  1. (College of Computer Science and Technology, Jilin University, Changchun 130012, China)
  • Received:2013-01-15 Online:2014-01-15 Published:2014-01-13

摘要: 实例选择能有效移除数据中的噪声和冗余数据,但现有方法难以在提高泛化能力的同时实现约简。针对该问题,提出一种冗余实例对消除算法用于实例选择。给出最近同类实例对的概念,计算数据集中存在的最近同类实例对,并移除满足条件的实例,在11个不同数据集上进行的仿真实验结果表明,经过该算法处理后的数据集在分类准确率和存储压缩率上较原始样本集有明显提升。对比剪辑最近邻规则算法,该算法能够在保持分类准确率的同时提高平均存储压缩率35%以上,并完整保留原始样本集的数据分布特征,在分类准确率和存储压缩率上取得折中。

关键词: 实例选择, 最近同类实例对, k最近邻, 剪辑最近邻规则算法, 数据约简, 机器学习

Abstract: Instance selection is a kind of effective method to remove the noise and redundant data. According to the unbalance between the generalization ability and reduction in existing instance selection methods, this paper proposes a new instance selection method: Redundant Instance Pair Elimination(RIPE) algorithm. It gives the concept of nearest similar pair, calculates the nearest similar pair of datasets, and removes the eligible instances. The simulation experimental results in 11 different datasets show that the classification accuracy and storage compression ratio of processed dataset are obviously improved compared with original datasets. Contrasted with Edited Nearest Neighbor rule(ENN) algorithm, this algorithm can keep the classification accuracy, improve more than 35% in average storage compression ratio, keep intact the data distribution of original datasets, and make better compromise in the classification accuracy and the storage compression ratio.

Key words: instance selection, nearest similar instance pair, k nearest neighbor, Edited Nearest Neighbor rule(ENN) algorithm, data reduction, machine learning

中图分类号: