计算机工程 ›› 2011, Vol. 37 ›› Issue (15): 122-124.doi: 10.3969/j.issn.1000-3428.2011.15.038

• 人工智能及识别技术 • 上一篇    下一篇

一种改进的不平衡数据集分类方法

赵秀宽1,阳建宏2,黎 敏2,徐金梧2   

  1. (1. 中国科学院地质与地球物理研究所,北京 100029;2. 北京科技大学机械工程学院,北京 100083)
  • 收稿日期:2011-02-17 出版日期:2011-08-05 发布日期:2011-08-05
  • 作者简介:赵秀宽(1982-),男,工程师、博士,主研方向:模式识别,智能监测;阳建宏,副教授、博士;黎 敏,讲师、博士; 徐金梧,教授、博士
  • 基金项目:
    国家自然科学基金资助项目(50705069, 50905013, 509340 07);高等学校博士学科点专项科研基金资助项目(2009000612000 7);中央高校基本科研业务费专项基金资助项目(FRF-TP-09-014A)

Improved Unbalanced Dataset Classification Method

ZHAO Xiu-kuan  1, YANG Jian-hong  2, LI Min  2, XU Jin-wu  2   

  1. (1. Institute of Geology and Geophysics, Chinese Academy of Sciences, Beijing 100029, China; 2. School of Mechanical Engineering, University of Science and Technology Beijing, Beijing 100083, China)
  • Received:2011-02-17 Online:2011-08-05 Published:2011-08-05

摘要: 传统的机器学习方法在解决不平衡分类问题时,得到的分类器具有很大的偏向性,表现为少数类识别率远低于多数类。为此,在旋转森林分类方法的基础上,提出一种改进的不平衡数据处理方法——偏转森林。通过对少数类进行过抽样改变训练数据的分布以减小数据的不平衡,采用随机抽取的方式确保生成偏转矩阵的样本间存在差异,从而提高集成分类器的分类精度。实验结果表明,该方法能取得较好的分类性能,具有较高的少数类识别正确率和较低的多数类识别错误率。

关键词: 不平衡数据集, 偏转森林, 集成分类器, 过抽样

Abstract: Referring to unbalanced dataset, the traditional machine learning methods will achieve biased performance. Using traditional methods, the recognition rate of minority class will be lower than the recognition rate of majority class. In this paper, based on rotation forest, it proposes an improved unbalanced dataset learning method, which is called deflection forest. It reduces data unbalance by over-sampling the data of minority class. It uses random resampling to increase diversity of samples which generate the deflection matrix and then improves the accuracy rate. Experimental results show that the deflection forest method achieves better performance, which carries out higher recognition rate of minority class and also lower recognition error rate of majority class.

Key words: unbalanced dataset, deflection forest, integrated classifier, over-sampling

中图分类号: