计算机工程 ›› 2019, Vol. 45 ›› Issue (11): 54-61.doi: 10.19678/j.issn.1000-3428.0054988

• 先进计算与数据处理 • 上一篇    下一篇

基于CSD-ELM的不平衡数据分类算法

王大飞a, 解武杰b, 董文瀚b   

  1. 空军工程大学 a. 研究生院;b. 航空工程学院, 西安 710038
  • 收稿日期:2019-05-22 修回日期:2019-06-25 发布日期:2019-06-29
  • 作者简介:王大飞(1984-),男,硕士研究生,主研方向为数据挖掘、机器学习;解武杰、董文瀚,教授、博士。
  • 基金项目:
    航空科学基金(20141396012)。

Imbalanced Data Classification Algorithm Based on CSD-ELM

WANG Dafeia, XIE Wujieb, DONG Wenhanb   

  1. a. Graduate School;b. College of Aeronautics Engineering, Air Force Engineering University, Xi'an 710038, China
  • Received:2019-05-22 Revised:2019-06-25 Published:2019-06-29

摘要: 基于代价敏感学习的极限学习机(ELM)算法在处理不平衡数据分类问题时,未考虑不同类别样本的分布特点以及同一类别中各样本的重要性对分类结果的影响。为此,提出基于样本数量比例的错分惩罚因子设置方法,并基于Mini-batch k-means聚类与距离测度设计一种类内样本权值确定方案。在此基础上,构建区分正、负类别的隐含层输出矩阵,根据训练样本数与ELM隐含层节点数间的关系,分2种情况计算ELM隐含层与输出层间的连接权值,以降低算法的时间复杂度。实验结果表明,与ELM、WELM等算法相比,该算法的G-mean、F1分类性能指标值均较高。

关键词: 不平衡数据, 极限学习机, 代价敏感学习, Mini-batch k-means聚类, 约束优化理论

Abstract: The Extreme Learning Machine(ELM) based on cost-sensitive learning has its advantages in dealing with imbalanced data classification problems.However,it fails to consider the distribution characteristics of samples in different classes and the importance of each sample in the same class,both of which can have influence on the classification results.Therefore,we propose a setting method for misclassified penalty factor based on the proportion of sample size.Besides,based on Mini-batch k-means clustering and distance measure,we propose a determination method for the weights of samples in the same class.On this basis,we build the output matrix of the hidden layer to distinguish the positive and negative categories.According to the relationship between the size of training samples and the number of nodes in the ELM hidden layer,we calculate the connection weights between the hidden layer and the output layer of ELM in two conditions,thus reducing the time complexity of the algorithm.Experimental results show that compared with ELM,WELM and other algorithms,the proposed algorithm has higher G-mean and F1 classification performance index.

Key words: imbalanced data, Extreme Learning Machine(ELM), cost-sensitive learning, Mini-batch k-means clustering, constrained optimization theory

中图分类号: