作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (9): 16-22. doi: 10.19678/j.issn.1000-3428.0066836

• 热点与综述 • 上一篇    下一篇

基于分布式扰动的文本对抗训练方法

沈志东, 岳恒宪   

  1. 武汉大学 国家网络安全学院, 武汉 430000
  • 收稿日期:2023-01-30 出版日期:2023-09-15 发布日期:2023-09-14
  • 作者简介:

    沈志东(1975—),男,副教授、博士,CCF会员,主研方向为人工智能、分布式系统、网络安全

    岳恒宪,硕士研究生

  • 基金资助:
    国家重点研发计划(2018YFC1604000); 湖北省重点研发计划项目(2022BAA041)

Textual Adversarial Training Method Based on Distributed Perturbation

Zhidong SHEN, Hengxian YUE   

  1. School of Cyber Science and Engineering, Wuhan University, Wuhan 430000, China
  • Received:2023-01-30 Online:2023-09-15 Published:2023-09-14

摘要:

文本对抗防御旨在增强神经网络模型对不同对抗攻击的抵御能力, 目前的文本对抗防御方法通常只能对某种特定对抗攻击有效,对于原理不同的对抗攻击效果甚微。为解决文本对抗防御方法的不足,提出一种文本对抗分布训练(TADT)方法,将TADT形式化为一个极小极大优化问题,其中内部最大化的目标是了解每个输入示例的对抗分布,外部最小化的目标是通过最小化预期损失来减小对抗示例的数量,并对基于梯度下降和同义词替换的攻击方法进行研究。在2个文本分类数据集上的实验结果表明,相比于DNE方法,在PWWS、GA、UAT等3种不同的对抗攻击下,TADT方法的准确率平均提升2%,相比于其他方法提升了10%以上,且在不影响干净样本准确率的前提下显著提升了模型的鲁棒性,并在各种对抗攻击下具有较高的准确率,展示了良好的泛化性能。

关键词: 文本对抗分布, 对抗训练, 变分自动编码器, 梯度下降, 蒙特卡罗采样

Abstract:

Text adversarial defense aims to enhance the resilience of neural network models against different adversarial attacks. The current text confrontation defense methods are usually only effective against certain specific confrontation attacks and have little effect on confrontation attacks with different principles. To address the deficiencies of existing textual adversarial defense methods and principles of adversarial attack methods, this paper proposes the Textual Adversarial Distribution Training(TADT) method and formalizes it as a minimax optimization problem. The goal of inner maximization is to learn the adversarial distribution of each input example. The goal of outer minimization is to reduce the number of adversarial examples by minimizing the expected loss. This paper mainly studies the attack method based on gradient descent and synonym replacement. The experimental results on two text classification datasets show that under three different adconfrontation attacks, Probability Weighted Word Saliency(PWWS), Genetic Attack(GA), and Unsupervised Adversarial Training(UAT), the accuracy of TADT is improved by an average of 2% compared with the latest Dirichlet Neighborhood Ensemble(DNE) method. Compared with other methods, the accuracy of TADT method is improved by more than 10%, and the accuracy of clean samples is not affected. On the premise of not affecting the accuracy of clean samples, TADT significantly improves the robustness of the model and has high accuracy under various adversarial attacks, showing good generalization performance.

Key words: textual adversarial distribution, Adversarial Training(AT), variational autoencoder, gradient descent, Monte Carlo sampling