作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (6): 174-181. doi: 10.19678/j.issn.1000-3428.0062241

• 先进计算与数据处理 • 上一篇    下一篇

基于最大最小距离的多中心数据综合增强方法

曹瑞阳, 郭佑民, 牛满宇   

  1. 兰州交通大学 机电技术研究所, 兰州 730070
  • 收稿日期:2021-08-01 修回日期:2021-09-27 发布日期:2021-10-15
  • 作者简介:曹瑞阳(1995—),男,硕士研究生,主研方向为数据分析;郭佑民,教授;牛满宇,硕士研究生。
  • 基金资助:
    国家自然科学基金(72061021)。

Integrated Enhancement Method for Multi-Center Data Based on Max-Min Distance

CAO Ruiyang, GUO Youmin, NIU Manyu   

  1. Mechatronics T&R Institute, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Received:2021-08-01 Revised:2021-09-27 Published:2021-10-15

摘要: 数据增强是解决数据集不平衡的有效方法,针对现有的数据增强方法存在生成样本越界和随机性差的问题,提出一种基于最大最小距离的多中心数据增强方法MCA。通过计算所有样本的加权密度,减少离群点对最终分类结果的影响,同时将抽样方法与最大最小距离算法相结合选择最优的数据,生成多中心点集,避免生成结果出现样本类别越界的情况,从而拓展样本数据的多样性,并且降低时间复杂度。在此基础上,根据样本的相似性构建权重函数,计算加权平均生成新的样本,解决原有数据集不平衡的问题。在SwedishLeaf数据集和实测数据集上进行实验,结果表明,相比SMOTE、Easy Ensemble、RR等方法,该方法的精确率和召回率均提高了1.17%以上,F1值提高了2%以上,能够有效提高泛化能力,在少数类和多数类样本不平衡率较高的情况下具有较优的分类性能。

关键词: 数据增强, 最大最小距离, 加权密度, 抽样方法, 样本容量, 深度残差网络

Abstract: Data enhancement is an effective method for solving the imbalance in datasets.The existing data enhancement methods, however, generate samples transgression and show poor randomness of generated samples.Accordingly, in this study, a multi-center data enhancement method MCA is proposed based on max-min distance.By calculating the weighted density of all samples, the influence of outliers on the final classification results is reduced.At the same time, the sampling method is combined with the max-min distance algorithm to select the optimal data and generate a multi-center point set to avoid the sample categories transgression in the generated results.Consequently, the diversity of sample data is expanded and the time complexity is reduced.The weight function is constructed according to the similarity of samples, and the weighted average is calculated to generate new samples to solve the imbalance of the original dataset.Experiments are performed on the SwedishLeaf dataset and the measured dataset.The results show that compared with SMOTE, Easy Ensemble, RR, and other methods, the accuracy and recall of this method improved by more than 1.17%, and the F1 value exceeded 2%.The proposed method can effectively improve the generalization ability and has better classification effect in the case of high imbalance rate of a few classes and most classes.

Key words: data enhancement, max-min distance, weighted density, sampling method, sample size, deep residual network

中图分类号: