作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

面向不平衡数据集分类模型的优化研究

温雪岩 1,陈家男 1,景维鹏 1,徐克生 2   

  1. (1.东北林业大学 信息与计算机工程学院,哈尔滨 150040; 2.国家林业局 哈尔滨林业机械研究所,哈尔滨 150086)
  • 收稿日期:2017-11-06 出版日期:2018-04-15 发布日期:2018-04-15
  • 作者简介:温雪岩(1971—),男,副教授、硕士,主研方向为机器学习、数据挖掘;陈家男,硕士研究生;景维鹏,副教授、博士;徐克生,研究员。
  • 基金资助:
    国家重点研发计划项目(2016YFD0702105)。

Research on Optimization of Classification Model for Imbalanced Data Set

WEN Xueyan  1,CHEN Jianan  1,JING Weipeng  1,XU Kesheng  2   

  1. (1.College of Information and Computer Engineering,Northeast Forestry University,Harbin 150040,China; 2.Harbin Forestry Machinery Research Institute,State Forestry Administration,Harbin 150086,China)
  • Received:2017-11-06 Online:2018-04-15 Published:2018-04-15

摘要: 为提高不平衡数据集的分类效率,建立一种分类模型,从样本采样和分类算法两方面进行优化。对决策边界的少类样本进行循环过采样生成新样本集,并与决策边界外合成的少类样本集合并,提高样本的重要度。针对传统ε-支持向量机(ε-SVM)在对不平衡数据集分类时超平面偏移的问题,引入正负惩罚系数和混合核函数,并利用客观的熵值法选取惩罚系数,提高分类算法的性能。实验结果表明,与标准的SVM算法相比,该分类模型在不平衡数据集分类上F-measure值平均提高18.1%,具有较好的分类效果。

关键词: 文本分类, 不均衡数据集, 数据挖掘, 样本重采样, 熵值法

Abstract: In order to improve the classification efficiency of unbalanced data sets,this paper proposes a classification model.The sample sampling and classification algorithm are optimized.A new sample set is generated by cyclic sampling of the few samples of the decision boundary,combined with the small sample sets synthesized outside the boundary of the decision-making,then the importance of the sample is improved.Aiming at the problem of hyperplane offset in classification of imbalanced data sets by traditional ε-Support Vector Machine(ε-SVM),the positive and negative penalty coefficients and the mixed kernel function are introduced.The objective entropy value method is used to select the penalty coefficients and the performance of the classification algorithm is improved.Experimental results show that compared with the standard SVM algorithm,the classification is better in the classification of imbalanced data sets,the average F-measure value is increased by 18.1%,and the better classification results are achieved.

Key words: text categorization, imbalanced data set, data mining, sample resampling, entropy method

中图分类号: