计算机工程 ›› 2019, Vol. 45 ›› Issue (6): 218-224.doi: 10.19678/j.issn.1000-3428.0050618

• 人工智能及识别技术 • 上一篇    下一篇

改进欠抽样方法及其在非平衡数据集分类中的应用

牛壮,李凤莲,张雪英,樊宇宙,魏鑫   

  1. 太原理工大学 信息与计算机学院,山西 晋中 030600
  • 收稿日期:2018-03-05 出版日期:2019-06-15 发布日期:2019-06-15
  • 作者简介:牛壮(1991—),男,硕士研究生,主研方向为数据挖掘;李凤莲(通信作者),教授、博士;张雪英,教授、博士、博士生导师;樊宇宙、魏鑫,硕士研究生。
  • 基金项目:

    山西省自然科学基金(201801D121138);山西省重点研发计划(201803D31045);山西省科技重大专项(20181102008)。

Improved under-sampling method and its application in the classification of imbalanced data sets

NIU Zhuang,LI Fenglian,ZHANG Xueying,FAN Yuzhou,WEI Xin   

  1. College of Information and Computer,Taiyuan University of Technology,Jinzhong,Shanxi 030600,China
  • Received:2018-03-05 Online:2019-06-15 Published:2019-06-15

摘要:

欠抽样方法在非平衡数据集分类时,未充分考虑数据分布变化对分类结果造成的影响。为此,提出一种基于聚类融合去冗余的改进欠抽样方法。采用聚类算法得到多数类样本高密度分布区域的聚类中心,将多数类样本划分为不同子集,通过计算各子集的相似度冗余系数对多数类样本进行去冗余删除,以达到欠抽样的目的。对15个不同平衡率的数据集欠抽样后,利用代价敏感混合属性多决策树模型进行分类。实验结果表明,在不降低非平衡数据集分类准确率的前提下,该方法能够提高少数类样本的正类率及预测模型的G-mean值。

关键词: 非平衡数据集, 聚类算法, 欠抽样, 去冗余, 多决策树预测模型

Abstract:

The removal under-sampling method does not consider much the influence of data distribution changes on the classification results when the unbalanced data sets are classified,an improved under-sampling method based on clustering fusion and redundancy removal is proposed.The clustering algorithm is used to obtain the clustering centers of the high-density distribution regions of most samples.Most of the samples are divided into different subsets.The redundancy coefficients of each subset are calculated to de-redundantly delete most of the samples.After under-sampling the data sets with different balance rates,the cost-sensitive attribute hybrid strategy multi-decision tree prediction model is used for classification.Experimental results show that the proposed method can enable the prediction model to improve the positive rate of a few samples and the G-mean value of the prediction model under the premise of ensuring that the classification accuracy of the unbalanced data sets is not reduced.

Key words: imbalanced data sets, clustering algorithm, under-sampling, redundancy removal, multi-decision tree prediction model

中图分类号: