作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

面向不均衡数据集中少数类细分的过采样算法

古平,杨炀   

  1. (重庆大学 计算机学院,重庆 400044)
  • 收稿日期:2016-01-22 出版日期:2017-02-15 发布日期:2017-02-15
  • 作者简介:古平(1976—),男,副教授,主研方向为数据挖掘、机器学习;杨炀,硕士研究生。
  • 基金资助:
    中央高校基本科研业务费专项资金项目(106112013CDJZR180014);重庆市自然科学基金(cstc2012jjA40002)。

Oversampling Algorithm Oriented to Subdivision of Minority Class in Imbalanced Data Set

GU Ping,YANG Yang   

  1. (School of Computer Science,Chongqing University,Chongqing 400044,China)
  • Received:2016-01-22 Online:2017-02-15 Published:2017-02-15

摘要: 在不均衡数据集中,少数类样本的分布相对于决策边界具有差异性,而传统的过抽样算法通常并未对差异性做不同处理。为此,提出一种面向不均衡数据集的过采样算法SD-ISMOTE。该算法根据少数类样本的k近邻分布将其细分为DANGER,AL_SAFE,SAFE 3个集合,DANGER和AL_SAFE中的样本更靠近决策边界。借助ISMOTE思想在n维球体内随机插值,扩大两类样本的过采样范围,同时引入轮盘赌选择算法进行采样选择,避免新生成的样本冗余。实验结果表明,SD-ISMOTE算法在C4.5和朴素贝叶斯分类器下的分类性能相较于Borderline-SMOTE和ISMOTE均有不同程度的提高,可有效解决数据集中样本分布不均衡的问题。

关键词: 不均衡数据集, 决策边界, 分类, 随机插值, 少数类细分

Abstract: The distributions of the minority class samples in the imbalanced data set are discrepant.Traditional oversampling algorithms do not dispose this discrepancy.In order to handle this discrepancy,this paper proposes an oversampling algorithm oriented to subdivision of the minority class samples in the imbalanced data set,named SD-ISMOTE.This algorithm divides minority class samples into three subdivisions according to the distributions of their k-nearest neighbor,the three subdivisions are DANGER,AL_SAFE,SAFE.Samples in DANGER and AL_SAFE are closer to the decision boundary.The algorithm uses ISMOTE idea to make random interpolation in the n-dimensional ball space,expanding the sampling range of those samples in DANGER and AL_SAFE.Besides,in order to avoid redundancy,it leads the roulette into SD-ISMOTE.Experimental results show that SD-ISMOTE algorithm improves the imbalanced degree of the imbalanced data set distribution effectively,compared with Borderline-SMOTE and ISMOTE algorithms,and it brings better classification performance on imbalanced data set with C4.5 and naive Bayesian.

Key words: imbalanced data set, decision boundary, classification, random interpolation, subdivision of minority class

中图分类号: