改进欠抽样方法及其在非平衡数据集分类中的应用

doi:10.19678/j.issn.1000-3428.0050618

计算机工程 ›› 2019, Vol. 45 ›› Issue (6): 218-224. doi: 10.19678/j.issn.1000-3428.0050618

改进欠抽样方法及其在非平衡数据集分类中的应用

牛壮,李凤莲,张雪英,樊宇宙,魏鑫

太原理工大学信息与计算机学院,山西晋中 030600

收稿日期:2018-03-05 出版日期:2019-06-15 发布日期:2019-06-15
作者简介:牛壮(1991—),男,硕士研究生,主研方向为数据挖掘;李凤莲(通信作者),教授、博士;张雪英,教授、博士、博士生导师;樊宇宙、魏鑫,硕士研究生。
基金资助:
山西省自然科学基金(201801D121138);山西省重点研发计划(201803D31045);山西省科技重大专项(20181102008)。

Improved under-sampling method and its application in the classification of imbalanced data sets

NIU Zhuang,LI Fenglian,ZHANG Xueying,FAN Yuzhou,WEI Xin

College of Information and Computer,Taiyuan University of Technology,Jinzhong,Shanxi 030600,China

Received:2018-03-05 Online:2019-06-15 Published:2019-06-15

摘要/Abstract

摘要：

欠抽样方法在非平衡数据集分类时,未充分考虑数据分布变化对分类结果造成的影响。为此,提出一种基于聚类融合去冗余的改进欠抽样方法。采用聚类算法得到多数类样本高密度分布区域的聚类中心,将多数类样本划分为不同子集,通过计算各子集的相似度冗余系数对多数类样本进行去冗余删除,以达到欠抽样的目的。对15个不同平衡率的数据集欠抽样后,利用代价敏感混合属性多决策树模型进行分类。实验结果表明,在不降低非平衡数据集分类准确率的前提下，该方法能够提高少数类样本的正类率及预测模型的G-mean值。

关键词: 非平衡数据集, 聚类算法, 欠抽样, 去冗余, 多决策树预测模型

Abstract:

The removal under-sampling method does not consider much the influence of data distribution changes on the classification results when the unbalanced data sets are classified,an improved under-sampling method based on clustering fusion and redundancy removal is proposed.The clustering algorithm is used to obtain the clustering centers of the high-density distribution regions of most samples.Most of the samples are divided into different subsets.The redundancy coefficients of each subset are calculated to de-redundantly delete most of the samples.After under-sampling the data sets with different balance rates,the cost-sensitive attribute hybrid strategy multi-decision tree prediction model is used for classification.Experimental results show that the proposed method can enable the prediction model to improve the positive rate of a few samples and the G-mean value of the prediction model under the premise of ensuring that the classification accuracy of the unbalanced data sets is not reduced.

Key words: imbalanced data sets, clustering algorithm, under-sampling, redundancy removal, multi-decision tree prediction model

中图分类号:

TP391.4

牛壮,李凤莲,张雪英,樊宇宙,魏鑫. 改进欠抽样方法及其在非平衡数据集分类中的应用[J]. 计算机工程, 2019, 45(6): 218-224.

NIU Zhuang,LI Fenglian,ZHANG Xueying,FAN Yuzhou,WEI Xin. Improved under-sampling method and its application in the classification of imbalanced data sets[J]. Computer Engineering, 2019, 45(6): 218-224.

http://www.ecice06.com/CN/Y2019/V45/I6/218

参考文献 19

［1］	蔡艳艳,宋晓东.针对非平衡数据分类的新型模糊SVM模型［J］.西安电子科技大学学报(自然科学版),2015,42(5):120-124,160.
［2］	刘红岩,陈剑,陈国青.数据挖掘中的数据分类算法综述［J］.清华大学学报(自然科学版),2002,42(6):727-730.
［3］	史岩,李小民,齐晓慧.一种新型欠采样的支持向量机非平衡数据故障诊断研究［J］.计算机测量与控制,2012,20(5):1203-1204,1235.
［4］	CHAN P K,STOLFO S J.Toward scalable learning with non-uniform class and cost distributions:a case study in credit card fraud detection［C］//Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.［S.l.］:AAAI Press,1998:164-168.
［5］	CHANG R F,WU WenJie,MOON W K,et al.Support vector machines for diagnosis of breast tumors on US images［J］.Academic Radiology,2003,10(2):189-197.
［6］	杜春蕾,张雪英,李凤莲.改进的CART算法在煤层底板突水预测中的应用［J］.工矿自动化,2014,40(12):52-56.
［7］	张翕茜,李凤莲,张雪英,等.基于代价敏感混合分裂策略的多决策树算法［J］.计算机技术与应用,2017,43(10):128-136.
［8］	SIERS M J,ISLAM M Z.Software defect prediction using a cost sensitive decision forest and voting,and a potential solution to the class imbalance problem［J］.Information Systems,2015,51(C):62-71.
［9］	XIA Xin,LO D,SHIHAB E,et al.ELBlocker:predicting blocking bugs with ensemble imbalance learning［J］.Information and Software Technology,2015,61:93-106.
［10］	KHANCHI S,HEYWOOD M I,ZINCIR-HEYWOOD A N.Properties of a GP active learning framework for streaming data with class imbalance［C］//Proceedings of the Genetic and Evolutionary Computation Conference.New York,USA:ACM Press,2017:945-952.
［11］	CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique［J］.Journal of Artificial Intelligence Research,2002,16(1):321-357.
［12］	GARCA S,HERRERA F.Evolutionary under-sampling for classification with imbalanced datasets:proposals and taxonomy［J］.Evolutionary Computation,2014,17(3):275-306.
［13］	BATISTA G E A P A,PRATI R C,MONARD M C.A study of the behavior of several methods for balancing machine learning training data［J］.ACM SIGKDD Explorations Newsletter,2004,6(1):20-29.
［14］	LAURIKKALA J.Improving identification of difficult small classes by balancing class distribution［C］//Proceedings of the 8th Conference on Artificial Intelligence Medicine.Berlin,Germany:Springer,2001:63-66.
［15］	LIN Weichao,TSAI C F,HU Yahan,et al.Cluster-based undersampling in class-imbalanced data［J］.Information Sciences,2017,409-410:17-26.
［16］	史颖,亓慧.一种去冗余抽样的非平衡数据分类方法［J］.山西大学学报(自然科学版),2017,40(2):255-261.
［17］	LI Fenglian,ZHANG Xueying,ZHANG Xiqian,et al.Cost-sensitive and hybrid attribute measure multi-decision tree over imbalanced data sets［J］.Information Sciences,2018,422:242-256.
［18］	BEYAN C,FISHER R.Classifying imbalanced data sets using similarity based hierarchical decomposition［J］.Pattern Recognition,2015,48(5):1653-1672.
［19］	RIVERA W A.Noise reduction a priori synthetic over-sampling for class imbalanced data sets［J］.Information Sciences,2017,408(C):146-161.

[1]	王芙银, 张德生, 肖燕婷. 基于加权共享近邻与累加序列的密度峰值算法[J]. 计算机工程, 2022, 48(4): 61-69.
[2]	王治和, 王淑艳, 杜辉. 基于密度敏感距离的改进模糊C均值聚类算法[J]. 计算机工程, 2021, 47(5): 88-96,103.
[3]	周伟枭, 蓝雯飞. 融合文本分类的多任务学习摘要模型[J]. 计算机工程, 2021, 47(4): 48-55.
[4]	刘宇航, 马慧芳, 刘海姣, 余丽. 一种可重叠子空间K-Means聚类算法[J]. 计算机工程, 2020, 46(8): 58-63,71.
[5]	陆慎涛, 葛洪伟. 一种抗噪的移动时间势能聚类算法[J]. 计算机工程, 2020, 46(5): 144-149.
[6]	张强, 张勇, 刘芝国, 周文军, 刘佳慧. 基于改进YOLOv3的手势实时识别方法[J]. 计算机工程, 2020, 46(3): 237-245,253.
[7]	唐鸿成, 文畅, 冯文祥, 谢凯, 方文青. 基于智能聚类模型的海量数据快速显示方法[J]. 计算机工程, 2019, 45(8): 53-59.
[8]	钱雪忠,姚琳燕. 面向稀疏高维大数据的扩展增量模糊聚类算法[J]. 计算机工程, 2019, 45(6): 75-81.
[9]	王卫华,应时,贾向阳,王冰明,程国力. 一种基于日志聚类的多类型故障预测方法[J]. 计算机工程, 2018, 44(7): 67-73.
[10]	谢永华,朱延刚,赵贤国. 基于Zernike矩与BoF-SURF特征融合的花粉图像分类识别[J]. 计算机工程, 2018, 44(7): 259-263,270.
[11]	陈建,王子磊,奚宏生. 基于情境感知的广播电视群组发现策略[J]. 计算机工程, 2018, 44(5): 140-145.
[12]	任云,程福林,黎洪松. 基于频率敏感三维自组织映射的立体视频视差估计算法[J]. 计算机工程, 2018, 44(5): 252-255.
[13]	赵英,韩春昊. 马尔科夫模型在网络流量分类中的应用与研究[J]. 计算机工程, 2018, 44(5): 291-295.
[14]	宁可,孙同晶,徐洁洁. 面向海量数据的改进最近邻优先吸收聚类算法[J]. 计算机工程, 2018, 44(4): 35-40.
[15]	曾碧,黄文. 一种融合多特征聚类集成的室内点云分割方法[J]. 计算机工程, 2018, 44(3): 281-286.

选择文件类型/文献管理软件名称

选择包含的内容

改进欠抽样方法及其在非平衡数据集分类中的应用

Improved under-sampling method and its application in the classification of imbalanced data sets

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献 19

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

改进欠抽样方法及其在非平衡数据集分类中的应用

Improved under-sampling method and its application in the classification of imbalanced data sets

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献 19

相关文章 15

编辑推荐

Metrics

本文评价