面向不平衡数据集分类模型的优化研究

doi:10.3969/j.issn.1000-3428.2018.04.043

计算机工程

面向不平衡数据集分类模型的优化研究

温雪岩¹,陈家男¹,景维鹏¹,徐克生²

(1.东北林业大学信息与计算机工程学院,哈尔滨 150040; 2.国家林业局哈尔滨林业机械研究所,哈尔滨 150086)

收稿日期:2017-11-06 出版日期:2018-04-15 发布日期:2018-04-15
作者简介:温雪岩(1971—),男,副教授、硕士,主研方向为机器学习、数据挖掘;陈家男,硕士研究生;景维鹏,副教授、博士;徐克生,研究员。
基金资助:
国家重点研发计划项目(2016YFD0702105)。

Research on Optimization of Classification Model for Imbalanced Data Set

WEN Xueyan ¹,CHEN Jianan ¹,JING Weipeng ¹,XU Kesheng²

(1.College of Information and Computer Engineering,Northeast Forestry University,Harbin 150040,China; 2.Harbin Forestry Machinery Research Institute,State Forestry Administration,Harbin 150086,China)

Received:2017-11-06 Online:2018-04-15 Published:2018-04-15

摘要/Abstract

摘要： 为提高不平衡数据集的分类效率,建立一种分类模型,从样本采样和分类算法两方面进行优化。对决策边界的少类样本进行循环过采样生成新样本集,并与决策边界外合成的少类样本集合并,提高样本的重要度。针对传统ε-支持向量机(ε-SVM)在对不平衡数据集分类时超平面偏移的问题,引入正负惩罚系数和混合核函数,并利用客观的熵值法选取惩罚系数,提高分类算法的性能。实验结果表明,与标准的SVM算法相比,该分类模型在不平衡数据集分类上F-measure值平均提高18.1%,具有较好的分类效果。

关键词: 文本分类, 不均衡数据集, 数据挖掘, 样本重采样, 熵值法

Abstract: In order to improve the classification efficiency of unbalanced data sets,this paper proposes a classification model.The sample sampling and classification algorithm are optimized.A new sample set is generated by cyclic sampling of the few samples of the decision boundary,combined with the small sample sets synthesized outside the boundary of the decision-making,then the importance of the sample is improved.Aiming at the problem of hyperplane offset in classification of imbalanced data sets by traditional ε-Support Vector Machine(ε-SVM),the positive and negative penalty coefficients and the mixed kernel function are introduced.The objective entropy value method is used to select the penalty coefficients and the performance of the classification algorithm is improved.Experimental results show that compared with the standard SVM algorithm,the classification is better in the classification of imbalanced data sets,the average F-measure value is increased by 18.1%,and the better classification results are achieved.

Key words: text categorization, imbalanced data set, data mining, sample resampling, entropy method

中图分类号:

TP311

温雪岩,陈家男,景维鹏,徐克生. 面向不平衡数据集分类模型的优化研究[J]. 计算机工程.

WEN Xueyan,CHEN Jianan,JING Weipeng,XU Kesheng. Research on Optimization of Classification Model for Imbalanced Data Set[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2018/V44/I4/268

参考文献

参考文献［1］GARCA S,LUENGO J,HERRERA F.Data preprocessing in data mining［M］.Berlin,Germany:Springer,2016. ［2］沈夏炯,王龙,韩道军.人工蜂群优化的BP神经网络在入侵检测中的应用［J］.计算机工程,2016,42(2):190-194. ［3］YU Qiao,JIANG Shujuan,ZHANG Yanmei.The performance stability of defect prediction models with class imbalance:an empirical study［J］.IEICE Transactions on Information & Systems,2017,100(2):265-272. ［4］ZHANG Chunkai,WANG Guoquan,ZHOU Ying,et al.A new approach for imbalanced data classification based on minimize loss learning［C］//Proceedings of the 2nd International Conference on Data Science in Cyberspace.Washington D.C.,USA:IEEE Press,2017:82-87. ［5］NAPIERALA K,STEFANOWSKI J.Types of minority class examples and their influence on learning classifiers from imbalanced Data［J］.Journal of Intelligent Information Systems,2016,46(3):563-597. ［6］HERRERA F.Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced Big Data［J］.Fuzzy Sets & Systems,2015,258(3):5-38. ［7］CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique［J］.Journal of Artificial Intelligence Research,2002,16(1):321-357. ［8］HAN Hui,WANG Wenyuan,MAO Binghuan.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning［C］//Proceedings of International Conference on intelligent Computing.Berlin,Germany:Springer,2005:878-887. ［9］衣柏衡,朱建军,李杰.基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类［J］.中国管理科学,2016,24(3):24-30. ［10］杨俊燕,张优云,朱永生.ε不敏感损失函数支持向量机分类性能研究［J］.西安交通大学学报,2007,41(11):1315-1320. ［11］赵淑娟.基于非对称加权和核方法的不平衡数据集［D］.南京:南京邮电大学,2013. ［12］ALZATE C,SUYKENS J.Kernel component analysis using an epsilon-insensitive robust loss function［J］.IEEE Transactions on Neural Networks,2008,19(9):1583-1598. ［13］WATANABE K.Vector quantization based on ε-insensitive mixture models［J］.Neurocomputing,2015,165(3):32-37. (下转第293页) (上接第273页) ［14］唐奇,王红瑞,许新宜,等.基于混合核函数SVM水文时序模型及其应用［J］.系统工程理论与实践,2014,34(2):521-529. ［15］颜根廷,马广富,肖余之.一种混合核函数支持向量机算法［J］.哈尔滨工业大学学报,2007,39(11):1704-1706. ［16］刘东启,陈志坚,徐银,等.面向不平衡数据分类的复合SVM算法研究［EB/OL］.［2017-11-06］.http://kns.cnki.net/kcms/detail/51.1196.TP.20170401.1738.050.html. ［17］朱喜安,魏国栋.熵值法中无量纲化方法优良标准的探讨［J］.统计与决策,2015(2):12-15. ［18］FRANK A,ASUNCION A.UCI machine learning repository［EB/OL］.［2017-11-06］.http://archive.ics.uci.edu/ml. ［19］刘文贞,陈红岩,李孝禄,等.基于自适应变异粒子群算法的混合核ε-SVM在混合气体定量分析中的应用［J］.传感技术学报,2016,29(9):1464-1470. ［20］常甜甜.支持向量机学习算法若干问题的研究［D］.西安:西安电子科技大学,2010. ［21］古平,杨炀.面向不均衡数据集中少数类细分的过采样算法［J］.计算机工程,2017,43(2):241-247. 编辑顾逸斐

[1]	钱来, 赵卫伟. 基于对比学习和注意力机制的文本分类方法[J]. 计算机工程, 2024, 50(7): 104-111.
[2]	游奔, 李晓红, 姚锦, 冯绍杰. 基于多粒度图与注意力机制的半监督短文本分类[J]. 计算机工程, 2024, 50(5): 83-90.
[3]	邵良杉, 赵松泽. 基于多模型融合的不完整数据分数插补算法[J]. 计算机工程, 2023, 49(9): 79-88, 98.
[4]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[5]	席荣康, 蔡满春, 芦天亮. 基于数据增强与流数据处理的Tor流量分析模型[J]. 计算机工程, 2023, 49(3): 177-184.
[6]	王春东, 孙嘉琪, 杨文军. 基于矫正理解的中文文本对抗样本生成方法[J]. 计算机工程, 2023, 49(2): 37-45.
[7]	陈可嘉, 刘惠. 基于改进BiGRU-CNN的中文文本分类方法[J]. 计算机工程, 2022, 48(5): 59-66,73.
[8]	谷青竹, 董红斌. PPDM中面向k-匿名的MI Loss评估模型[J]. 计算机工程, 2022, 48(4): 143-147.
[9]	李冉冉, 刘大明, 刘正, 常高祥. 融合笔画特征的胶囊网络文本分类[J]. 计算机工程, 2022, 48(3): 69-73,80.
[10]	王璐, 刘晓清, 何震瀛. 连续时间区间内的频繁词序列挖掘算法[J]. 计算机工程, 2022, 48(2): 79-85,91.
[11]	张攀, 高丰, 周逸, 饶涵宇, 毛冬, 李静. 一种在线实时微服务调用链异常检测方法[J]. 计算机工程, 2022, 48(11): 161-169.
[12]	吴军, 欧阳艾嘉, 张琳. 面向置换检验的冗余对比模式过滤算法[J]. 计算机工程, 2022, 48(1): 75-84.
[13]	吴军, 欧阳艾嘉, 张琳. 面向对比序列模式发现的独立精确置换检验算法[J]. 计算机工程, 2021, 47(8): 45-53,61.
[14]	武娇, 洪彩凤, 顾永春, 顾兴全, 金世举. 基于类邻域字典的线性回归文本分类[J]. 计算机工程, 2021, 47(8): 93-99,108.
[15]	周伟枭, 蓝雯飞. 融合文本分类的多任务学习摘要模型[J]. 计算机工程, 2021, 47(4): 48-55.

选择文件类型/文献管理软件名称

选择包含的内容

面向不平衡数据集分类模型的优化研究

Research on Optimization of Classification Model for Imbalanced Data Set

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

面向不平衡数据集分类模型的优化研究

Research on Optimization of Classification Model for Imbalanced Data Set

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价