作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (13): 147-149. doi: 10.3969/j.issn.1000-3428.2011.13.047

• 人工智能及识别技术 • 上一篇    下一篇

一种基于欠采样的不平衡数据分类算法

程险峰1,李 军2,3,李雄飞3   

  1. (1. 长春市公安局交通警察支队,长春 130011;2. 长春理工大学数学系,长春 130022; 3. 吉林大学符号计算与知识工程教育部重点实验室,长春 130012)
  • 收稿日期:2011-02-25 出版日期:2011-07-05 发布日期:2011-07-05
  • 作者简介:程险峰(1955-),男,高级工程师,主研方向:智能交通,数据挖掘;李 军,副教授、博士;李雄飞,教授、博士生导师
  • 基金资助:
    国家科技支撑计划基金资助项目(2006BAK01A33);公安部重点科研基金资助项目(B类)(20032252001);吉林省科技发展计划基金资助项目(20070321, 20090704)

Imbalanced Data Classification Algorithm Based on Undersampling

CHENG Xian-feng  1, LI Jun   2,3, LI Xiong-fei  3   

  1. (1. Traffic Police Detachment, Changchun Public Security Bureau, Changchun 130011, China; 2. Dept. of Mathematics, Changchun University of Science and Technology, Changchun 130022, China; 3. Key Laboratory of Symbolic Computation and Knowledge Engineering for Ministry of Education, Jilin University, Changchun 130012, China)
  • Received:2011-02-25 Online:2011-07-05 Published:2011-07-05

摘要: 针对不平衡数据学习问题,提出一种基于欠采样的分类算法。对多数类样例进行欠采样,保留位于分类边界附近的多数类样例。以AUC为优化目标,选择最恰当的邻域半径使数据达到平衡,利用欠采样后的样例训练贝叶斯分类器,并采用AUC评价分类器性能。仿真数据及UCI数据集上的实验结果表明,该算法有效。

关键词: 机器学习, 分类算法, 不平衡数据, 欠采样, 邻域

Abstract: Imbalanced Data Learning(IDL) problem is one of the research issues in machine learning. This paper presents a classification algorithm based on undersampling, which algorithm undersamples the majority examples, and retains the majority examples near the classify border. With the AUC as the optimization objectives. It chooses the most appropriate domain radius to balance the data set, and trains the Bayesian classifier by the use of the examples after undersampling. Using AUC as a measure of classifier performance evaluation, the experiments on simulation data and UCI data sets show that undersampling is effective.

Key words: machine learning, classification algorithm, imbalanced data, undersampling, neighborhood

中图分类号: