作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (2): 81-89. doi: 10.19678/j.issn.1000-3428.0063871

• 人工智能与模式识别 • 上一篇    下一篇

基于卷积神经网络的结构化非平衡数据分类算法

徐红1, 矫桂娥2,3, 张文俊2, 陈一民3   

  1. 1. 上海海洋大学 信息学院, 上海 201306;
    2. 上海大学 上海电影学院, 上海 200072;
    3. 上海建桥学院 信息技术学院, 上海 201306
  • 收稿日期:2022-01-30 修回日期:2022-03-10 发布日期:2022-07-05
  • 作者简介:徐红(1994-),女,硕士研究生,主研方向为大数据挖掘、数据分析;矫桂娥(通信作者),副教授;张文俊、陈一民,教授、博士、博士生导师。
  • 基金资助:
    国家自然科学基金(61572434);上海市科技创新行动计划项目(19511104502,16511101200);上海科学技术委员会基金(19DZ22048)。

Classification Algorithm for Structured Imbalanced Data Based on Convolutional Neural Network

XU Hong1, JIAO Guie2,3, ZHANG Wenjun2, CHEN Yimin3   

  1. 1. College of Information Technology, Shanghai Ocean University, Shanghai 201306, China;
    2. Shanghai Film Academy, Shanghai University, Shanghai 200072, China;
    3. College of Information Technology, Shanghai Jian Qiao University, Shanghai 201306, China
  • Received:2022-01-30 Revised:2022-03-10 Published:2022-07-05

摘要: 卷积神经网络具有高效的特征提取能力和较少的参数量,被广泛应用于图像处理、目标跟踪、自然语言等领域。针对传统分类模型对于结构化非平衡数据分类效果较差的问题,提出一种基于卷积神经网络的二分类结构化非平衡数据分类算法。设计结构化数据处理算法Data-Shuffle,将原始非平衡一维结构化数据转换为三维数组形式的多通道非平衡数据,为卷积神经网络提供更多的特征值,通过改进的VGG网络构建适合非平衡数据的网络结构卷积组,以提取不同的特征。在此基础上,提出更新权重加权采样算法UWSCNN,在每个迭代次数之后,根据模型的训练结果对易错样本进行重新加权,以优化训练结果。在adult、shoppers和diabetes数据集上的实验结果表明,相比逻辑回归、随机森林等传统机器学习模型,所提的Data-Shuffle算法的F1值提升了1%~19%,G-mean提升了2%~24%,相比SMOTECNN、BSMOTECNN、SMOTECNN+CS等采样算法,所提的UWSCNN算法对非平衡数据的分类效果提升了1%~13%,有效提高模型对非平衡数据的分类性能。

关键词: 非平衡数据, 结构化数据, VGG网络, 深度学习, 卷积神经网络

Abstract: Convolutional Neural Network(CNN) are widely used in image processing, object tracking, natural language, and other fields because of their efficient feature extraction capabilities and their use of fewer parameters.To address the problem in which traditional classification models have poor classification effects on structured imbalanced data, this study proposes a two-tier structured imbalanced data classification algorithm based on CNN.The study designs a structured data-processing algorithm called Data-Shuffle and converts the original imbalanced one-dimensional structured data into multi-channel imbalanced data in the form of a three-dimensional array.The study also introduces a greater number of possible eigenvalues for the CNN and builds a network structure convolution group suitable for imbalanced data through an improved VGG network to extract different features.Accordingly, an updated weighted sampling algorithm UWSCNN is then proposed.With each iteration of the algorithm, error-prone samples are reweighted based on the training results of the model to obtain optimized results.Experimental results on datasets of adult, shoppers, and diabetes show that, compared with traditional machine learning models such as logical regression and random forest, the F1 and G-mean values of the proposed Data-Shuffle algorithm are increased by 1%-19% and 2%-24%, respectively.Compared with sampled algorithms such as SMOTECNN, BSMOTECNN, and SMOTECNN+CS, the classification effect of the proposed UWSCNN algorithm on imbalanced data is improved by 1%-13%, effectively improving the classification performance of the model on imbalanced data.

Key words: imbalanced data, structured data, VGG network, deep learning, Convolutional Neural Network(CNN)

中图分类号: