作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (19): 1-3. doi: 10.3969/j.issn.1000-3428.2008.19.001

• 博士论文 •    下一篇

基于后验概率的不平衡数据集特征选择算法

曹苏群1,2,王士同1,陈晓峰1   

  1. (1. 江南大学信息学院,无锡 214122;2. 淮阴工学院机械工程系,淮安 223001)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-10-05 发布日期:2008-10-05

Posterior-probability-based Feature Selection Algorithm for Imbalanced Datasets

CAO Su-qun1,2, WANG Shi-tong1, CHEN Xiao-feng1   

  1. (1. School of Information, Jiangnan University, Wuxi 214122; 2. Department of Mechanical Engineering, Huaiyin Institute of Technology, Huaian 223001)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-10-05 Published:2008-10-05

摘要: 针对不平衡数据集,提出一种基于后验概率的特征选择算法。该算法引入基于Parzen-window方法估算的不均衡因子,并以Tomek links中点为初始值进行迭代,找出满足后验概率相等的判别边界点,通过对这些点法向量进行投影计算得到各特征的权值。实验表明,对于不平衡数据集,该算法在不降低分类器总体性能的基础上,不仅可以有效降低维度,节省计算开销,而且能够避免常规特征选择算法用于不平衡数据时忽视小类的缺点。

关键词: 不平衡数据集, 特征选择, 后验概率

Abstract: In this paper, a posterior-probability-based feature selection algorithm is proposed for imbalanced datasets. In the proposed algorithm, an imbalanced factor is introduced and computed by Parzen-window estimation. The middle point of Tomek links is chosen as the initial point. Accordingly, this algorithm is iterated to find out the boundary points which have the equality of posterior probability. Through the project computation on the normal vectors of these points, the weight of each feature can be obtained, which actually indicates the importance degree of each feature. The experimental results on three real-word datasets demonstrate that this proposed algorithm can not only reduce the computational cost but also overcome the shortcoming that the majority class may be detected well but the minority class may be ignored in the conventional feature selection algorithm.

Key words: imbalanced datasets, feature selection, posterior probability

中图分类号: