摘要: 传统的机器学习方法在解决不平衡分类问题时,得到的分类器具有很大的偏向性,表现为少数类识别率远低于多数类。为此,在旋转森林分类方法的基础上,提出一种改进的不平衡数据处理方法——偏转森林。通过对少数类进行过抽样改变训练数据的分布以减小数据的不平衡,采用随机抽取的方式确保生成偏转矩阵的样本间存在差异,从而提高集成分类器的分类精度。实验结果表明,该方法能取得较好的分类性能,具有较高的少数类识别正确率和较低的多数类识别错误率。
关键词:
不平衡数据集,
偏转森林,
集成分类器,
过抽样
Abstract: Referring to unbalanced dataset, the traditional machine learning methods will achieve biased performance. Using traditional methods, the recognition rate of minority class will be lower than the recognition rate of majority class. In this paper, based on rotation forest, it proposes an improved unbalanced dataset learning method, which is called deflection forest. It reduces data unbalance by over-sampling the data of minority class. It uses random resampling to increase diversity of samples which generate the deflection matrix and then improves the accuracy rate. Experimental results show that the deflection forest method achieves better performance, which carries out higher recognition rate of minority class and also lower recognition error rate of majority class.
Key words:
unbalanced dataset,
deflection forest,
integrated classifier,
over-sampling
中图分类号:
赵秀宽, 阳建宏, 黎敏, 徐金梧. 一种改进的不平衡数据集分类方法[J]. 计算机工程, 2011, 37(15): 122-124.
DIAO Xiu-Kuan, YANG Jian-Hong, LI Min, XU Jin-Wu. Improved Unbalanced Dataset Classification Method[J]. Computer Engineering, 2011, 37(15): 122-124.