摘要: 针对不平衡数据学习问题,提出一种基于欠采样的分类算法。对多数类样例进行欠采样,保留位于分类边界附近的多数类样例。以AUC为优化目标,选择最恰当的邻域半径使数据达到平衡,利用欠采样后的样例训练贝叶斯分类器,并采用AUC评价分类器性能。仿真数据及UCI数据集上的实验结果表明,该算法有效。
关键词:
机器学习,
分类算法,
不平衡数据,
欠采样,
邻域
Abstract: Imbalanced Data Learning(IDL) problem is one of the research issues in machine learning. This paper presents a classification algorithm based on undersampling, which algorithm undersamples the majority examples, and retains the majority examples near the classify border. With the AUC as the optimization objectives. It chooses the most appropriate domain radius to balance the data set, and trains the Bayesian classifier by the use of the examples after undersampling. Using AUC as a measure of classifier performance evaluation, the experiments on simulation data and UCI data sets show that undersampling is effective.
Key words:
machine learning,
classification algorithm,
imbalanced data,
undersampling,
neighborhood
中图分类号:
程险峰, 李军, 李雄飞. 一种基于欠采样的不平衡数据分类算法[J]. 计算机工程, 2011, 37(13): 147-149.
CHENG Jian-Feng, LI Jun, LI Xiong-Fei. Imbalanced Data Classification Algorithm Based on Undersampling[J]. Computer Engineering, 2011, 37(13): 147-149.