作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (5): 196-205. doi: 10.19678/j.issn.1000-3428.0068963

• 先进计算与数据处理 • 上一篇    下一篇

受同轴对称抛物线约束的少数类样本合成方法

朱宸敏1, 余粟2   

  1. 1. 上海工程技术大学电子电气工程学院, 上海 201620;
    2. 上海工程技术大学信息化办公室, 上海 201620
  • 收稿日期:2023-12-06 修回日期:2024-02-22 出版日期:2025-05-15 发布日期:2024-05-15
  • 通讯作者: 余粟,E-mail:yusu@sues.edu.cn E-mail:yusu@sues.edu.cn
  • 基金资助:
    国家科技支撑计划(2015BAF10B00);上海市科委科研计划(17511110204)。

Synthesis Method for Minority Samples with Constraints of Coaxial Symmetric Parabolas

ZHU Chenmin1, YU Su2   

  1. 1. College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China;
    2. Informatization Office, Shanghai University of Engineering Science, Shanghai 201620, China
  • Received:2023-12-06 Revised:2024-02-22 Online:2025-05-15 Published:2024-05-15

摘要: 线性插值是过采样技术中常见的样本合成方法,缺点是线性合成区域容易提高不同类别样本间的重叠度并降低采样结果的随机性,难以提升针对不平衡样本集的分类能力。为此,提出受同轴对称抛物线约束的样本合成方法。首先,建立一种自适应加权策略,通过危险度因子与相似度因子为少数类样本赋予权值,该权值能够决定采样过程中的样本合成方向及合成范围;然后,根据少数类样本与对应的样本权值构造一对同轴对称抛物线,并将这对抛物线相交得到的闭区域作为非线性的样本合成区域;最后,在引入新样本时通过该新样本近邻域内巴氏系数的变化情况判断此次采样是否能够有效避免入侵其他类别样本的分布区域,从而提高采样质量。在UCI机器学习库的6个公共样本集上实验结果表明,当C4.5作为分类器时,集成后的过采样方法与原始采样方法相比,精确率提高7.85百分点,召回率提高2.87百分点,G-means提高2.00百分点。

关键词: 线性插值, 抛物线, 自适应加权, 非线性合成区域, 巴氏系数

Abstract: Linear interpolation is often used by in over-sampling techniques to synthesize samples, but the its disadvantages include the lack of randomness in the sampling results and a tendency to increase the degree of class between samples of different categories, making it difficult to improve the classification ability for imbalanced sample sets. This paper proposes a generation method for minority samples with coaxial-symmetric parabolic constraints. First, for minority class samples, an adaptive weighting strategy based on the risk factor and similarity factor is established. The weight can determine the direction and range of sample synthesis during the sampling process. Then, a pair of coaxial symmetric parabolas based on minority class samples and corresponding sample weights is constructed, the closed region surrounded by a pair of coaxial symmetric parabolas is taken as the nonlinear synthesis region. Finally, when introducing a new sample, determine whether this sampling can effectively avoid invading the distribution areas of other categories of samples by observing the changes in the Bhattacharyya coefficient in the neighboring domain of the new sample, thereby improving the sampling quality. Comparison experiments on six public sample sets from the UCI show that when C4.5 is used as a classifier, the integrated oversampling method improves precision by 7.85 percentage points, recall by 2.87 percentage points, and G-means by 2.00 percentage points compared to the original sampling method.

Key words: linear interpolation, parabolas, adaptive weighting, nonlinear synthesis region, Bhattacharyya coefficient

中图分类号: