作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (11): 162-170. doi: 10.19678/j.issn.1000-3428.0069983

• 网络空间安全 • 上一篇    下一篇

基于特征融合的对抗样本定向目标攻击可迁移性增强

凌海*(), 凌捷   

  1. 广东工业大学计算机学院, 广东 广州 510006
  • 收稿日期:2024-06-11 修回日期:2024-07-17 出版日期:2025-11-15 发布日期:2024-09-11
  • 通讯作者: 凌海
  • 基金资助:
    广州市重点领域研发计划(202007010004)

Transferability Enhancement of Adversarial Sample Directed Targeted Attack Based on Feature Fusion

LING Hai*(), LING Jie   

  1. School of Computer, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
  • Received:2024-06-11 Revised:2024-07-17 Online:2025-11-15 Published:2024-09-11
  • Contact: LING Hai

摘要:

对抗样本可以在不知道黑盒模型内部的结构以及参数时利用代理模型进行迁移性攻击, 现有研究针对黑盒模型的定向目标攻击可迁移性都比较弱。提出一种基于特征融合增强图像定向目标攻击可迁移性的方法。通过模型集成攻击得到对抗样本, 以现有对抗样本的梯度方向为基准, 利用从原图提取出的干净特征作为干扰来微调现有的对抗样本, 以提高定向目标攻击的可迁移性。对于模型集成, 根据每个模型对整体对抗目标的贡献大小引入梯度自适应模块, 为减少不同模型之间的梯度差异, 提出梯度滤波器来同步控制梯度方向, 通过特征融合模块混合原图的干净特征对现有对抗样本的梯度方向进行微调以缓解过度关注特定特征的问题。在ImageNet-Compatible数据集上的对比实验结果表明, 所提方法对非鲁棒性训练模型, 相较CFM(Clean Feature Mixup)方法平均攻击成功率提升了7.7百分点, 对鲁棒性训练模型以及Tansformer模型, 相较CFM方法平均攻击成功率提升了5.3百分点, 验证了方法的有效性。

关键词: 深度学习, 对抗攻击, 对抗样本, 定向目标攻击, 可迁移性

Abstract:

Adversarial examples can be used to perform transferable attacks on black-box models using surrogates, without knowing the internal structure and parameters of the black-box model. Previous studies have reported relatively low transferability of targeted attacks on black-box models. This study proposes a method for enhancing the transferability of image-directed targeted attacks based on feature fusion. First, adversarial examples are generated via ensemble attacks. Subsequently, using the gradient direction of existing adversarial examples as a baseline, clean features extracted from the original image are used as perturbations to fine tune the existing adversarial examples for improving the transferability of targeted attacks. For model ensembling, a gradient adaptive module is introduced based on the contribution of each model to the overall adversarial objective. To reduce the gradient differences among different models, a gradient filter is proposed for synchronously controlling the gradient direction. Using the feature fusion module, the clean features of the original image are mixed to fine tune the gradient direction of the existing adversarial examples for mitigating the issue of overfocusing on specific features. Experiments on the ImageNet-Compatible dataset reveal that, compared to the Clean Feature Mixup (CFM) method, the proposed method improves the average attack success rate by 7.7 percentage points for non-robustly trained models and by 5.3 percentage points for robustly trained and Transformer models, demonstrating the effectiveness of the method.

Key words: deep learning, adversarial attacks, adversarial examples, directed targeted attacks, transferability