基于后验概率的不平衡数据集特征选择算法

doi:10.3969/j.issn.1000-3428.2008.19.001

计算机工程 ›› 2008, Vol. 34 ›› Issue (19): 1-3. doi: 10.3969/j.issn.1000-3428.2008.19.001

• 博士论文 • 下一篇

基于后验概率的不平衡数据集特征选择算法

曹苏群1,2，王士同1，陈晓峰1

(1. 江南大学信息学院，无锡 214122；2. 淮阴工学院机械工程系，淮安 223001)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-10-05 发布日期:2008-10-05

Posterior-probability-based Feature Selection Algorithm for Imbalanced Datasets

CAO Su-qun1,2, WANG Shi-tong1, CHEN Xiao-feng1

(1. School of Information, Jiangnan University, Wuxi 214122; 2. Department of Mechanical Engineering, Huaiyin Institute of Technology, Huaian 223001)

Received:1900-01-01 Revised:1900-01-01 Online:2008-10-05 Published:2008-10-05

摘要/Abstract

摘要： 针对不平衡数据集，提出一种基于后验概率的特征选择算法。该算法引入基于Parzen-window方法估算的不均衡因子，并以Tomek links中点为初始值进行迭代，找出满足后验概率相等的判别边界点，通过对这些点法向量进行投影计算得到各特征的权值。实验表明，对于不平衡数据集，该算法在不降低分类器总体性能的基础上，不仅可以有效降低维度，节省计算开销，而且能够避免常规特征选择算法用于不平衡数据时忽视小类的缺点。

关键词: 不平衡数据集, 特征选择, 后验概率

Abstract: In this paper, a posterior-probability-based feature selection algorithm is proposed for imbalanced datasets. In the proposed algorithm, an imbalanced factor is introduced and computed by Parzen-window estimation. The middle point of Tomek links is chosen as the initial point. Accordingly, this algorithm is iterated to find out the boundary points which have the equality of posterior probability. Through the project computation on the normal vectors of these points, the weight of each feature can be obtained, which actually indicates the importance degree of each feature. The experimental results on three real-word datasets demonstrate that this proposed algorithm can not only reduce the computational cost but also overcome the shortcoming that the majority class may be detected well but the minority class may be ignored in the conventional feature selection algorithm.

Key words: imbalanced datasets, feature selection, posterior probability

中图分类号:

TP181

曹苏群;王士同;陈晓峰. 基于后验概率的不平衡数据集特征选择算法[J]. 计算机工程, 2008, 34(19): 1-3.

CAO Su-qun; WANG Shi-tong; CHEN Xiao-feng. Posterior-probability-based Feature Selection Algorithm for Imbalanced Datasets[J]. Computer Engineering, 2008, 34(19): 1-3.

http://www.ecice06.com/CN/Y2008/V34/I19/1

[1]	杨璇, 马建敏, 赵曼君. 基于邻域互信息的高维时序数据特征选择[J]. 计算机工程, 2023, 49(7): 135-142.
[2]	刘利, 张德生, 肖燕婷. 基于隶属度的模糊加权k近质心近邻算法[J]. 计算机工程, 2022, 48(7): 122-129.
[3]	艾成豪, 高建华, 黄子杰. 混合特征选择和集成学习驱动的代码异味检测[J]. 计算机工程, 2022, 48(7): 168-176,198.
[4]	范林歌, 武欣嵘, 童玮, 曾维军. 基于概率矩阵分解的不完整数据集特征选择方法[J]. 计算机工程, 2022, 48(6): 57-64.
[5]	张要, 马盈仓, 朱恒东, 李恒, 陈程. 结合流形学习与逻辑回归的多标签特征选择[J]. 计算机工程, 2022, 48(3): 90-99,106.
[6]	汪正凯, 沈东升, 王晨曦. 基于文本分类的Fisher Score快速多标记特征选择算法[J]. 计算机工程, 2022, 48(2): 113-124.
[7]	黄奕轩, 杜世强, 余瑶, 肖庆江, 宋金梅. 基于特征选择与鲁棒图学习的多视图聚类[J]. 计算机工程, 2022, 48(12): 95-103.
[8]	姜红涛, 孙京, 谢成, 赖少川, 沈焕锋. 内外部梯度联合约束的图像超分辨率重建方法[J]. 计算机工程, 2022, 48(1): 220-227,235.
[9]	王俊红, 赵彬佳. 基于不平衡数据的特征选择算法研究[J]. 计算机工程, 2021, 47(11): 100-107.
[10]	王旭, 陈永乐, 王庆生, 陈俊杰. 结合特征选择与集成学习的密码体制识别方案[J]. 计算机工程, 2021, 47(1): 139-145,153.
[11]	袁哲明, 杨晶晶, 陈渊. 基于最大信息系数与冗余分摊的特征选择方法[J]. 计算机工程, 2020, 46(8): 101-105.
[12]	吴昌明, 赵兴涛, 柳可鑫. 基于三元组排序局部性的SOCFS改进算法[J]. 计算机工程, 2020, 46(5): 47-53.
[13]	周胜, 刘三民. 基于动态策略的多源迁移学习数据流分类研究[J]. 计算机工程, 2020, 46(5): 139-143,149.
[14]	陈良臣, 高曙, 刘宝旭, 陶明峰. 网络流量异常检测中的维数约简研究[J]. 计算机工程, 2020, 46(2): 11-20.
[15]	刘洁, 王铮, 王辉. 基于IMI-WNB算法的垃圾邮件过滤技术研究[J]. 计算机工程, 2020, 46(12): 299-304,312.

选择文件类型/文献管理软件名称

选择包含的内容

基于后验概率的不平衡数据集特征选择算法

Posterior-probability-based Feature Selection Algorithm for Imbalanced Datasets

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于后验概率的不平衡数据集特征选择算法

Posterior-probability-based Feature Selection Algorithm for Imbalanced Datasets

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价