一种面向视觉问答的自适应偏差学习网络

doi:10.19678/j.issn.1000-3428.0070616

摘要/Abstract

关键词: Visual Question Answering (VQA) is a research direction in cross-modal analysis focused on understanding and interpreting input images and their corresponding text questions to provide relevant natural language answers. However, existing research is limited by dependence on dataset factors, including pseudo-correlation, dataset bias, and shortcut learning, which challenges the robustness of algorithms. To improve the bias model’s capacity to learn from dataset bias, this paper proposes an Adaptive Bias Learning Network (ABLNet) for VQA tasks. ABLNet introduces two technical innovations: First, a self-adaptive sample reweighting mechanism dynamically assigns weights to each sample based on gradient information, enhancing the model’s ability to learn bias features and improve generalization. Second, a network pruning strategy based on restricted learning is introduced to limit the model’s dependence on surface correlations and dataset biases. Extensive experiments on challenging VQA datasets, VQA-CPv1, VQA-CPv2, and VQA-v2, demonstrate the effectiveness of our method.

Abstract: 视觉问答（Visual Question Answering, VQA）理解和解析输入图像及其对应的文本问题，进而提供与问题相关的自然语言答案，已成为跨模态分析领域一个前景广阔的研究方向。现有工作极大程度上依赖于数据集的一些因素，如伪相关、数据集偏差和捷径学习，都对算法鲁棒性带来了极大的挑战。现有基于集成学习的方法通过训练偏差模型捕捉数据集偏差，但由于偏差模型对偏差样本的识别能力不足，导致其难以充分学习偏差信息，进而削弱去偏效果。为了增强偏差模型学习数据集偏差的能力，本文针对 VQA 任务提出了一种自适应偏差学习网络（命名为 ABLNet）。ABLNet 的核心设计包括：首先，提出了一种自适应的样本重加权机制，基于每个样本的梯度信息动态分配权重，从而增强模型对数据集中偏差特征的学习，提升模型的泛化能力。其次，提出了一种基于受限学习的网络剪枝策略，通过限制偏差模型的学习能力，使其依赖于数据集中的表面相关性和偏差特征。在 VQA-CPv1、VQA-CPv2 和 VQA-v2 这些具有挑战性的 VQA 数据集上进行了大量实验，实验结果证明了我们方法的优越性。

万祖坤, 王润民, 马天明, 宋星东, 袁晟榕, 丁亚军. 一种面向视觉问答的自适应偏差学习网络[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0070616.

Zukun Wan, Runming Wang, Tianming Ma, Xingdong Song, Shengrong Yuan, Yajun Ding. ABLNet: An Adaptive Bias Learning Network for Visual Question Answering[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0070616.

参考文献

[1] Bi Y, Jiang H , Hu Y ,et al.See and Learn More: Dense Caption-Aware Representation for Visual Question Answering[J].IEEE Transactions on Circuits and Systems for Video Technology, 2024,2(34):1135-1146.
[2] 陈巧红,项深祥,方贤,等.跨模态自适应特征融合的视觉问答方法 [J/OL]. 哈尔滨工业大学学报 ,1-13[2025-03-25].http://kns.cnki.net/kcms/detail/23.12 35.T.20250314.1036.004.html. Chen Q H, Xiang S X, Fang X, et al. A visual question answering method with cross-modal adaptive feature fusion [J/OL]. Journal of Harbin Institute of Technology,1-13[2025-03-25].http://kns.cnki.net/kcms/det ail/23.1235.T.20250314.1036.004.html.
[3] 葛依琳,孙海春,袁得嵛.融合多模态知识与有监督检索的视觉问答模型 [J/OL]. 计算机科学与探索,1-17[2025-0325]. Ge Y L, Sun H C, Yuan D Y. A visual question answering model integrating multimodal knowledge and supervised retrieval [J/OL]. Journal of Frontiers of Computer Science and Technology, 1-17 [2025-03-25]. (in Chinese)
[4] 倪琴,刘双,余杨泽,等.基于多角度融合与联合记忆网络的视频问答认知模型[J].上海师范大学学报(自然科学版中英文 ),2024,53(05):596-603.DOI:10.20192/j.cnki. JSHNU(NS).2024.05.003. Wang Y, Liu M, Wu J, et al. Multi-granularity interaction and integration network for video question answering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(12): 7684-7695.
[5] Antol S, Agrawal A, Lu J, et al. Vqa: Visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2425-2433.
[6] Zhang J, Liu X, Wang Z. Latent Attention Network With Position Perception for Visual Question Answering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024:1-11.
[7] Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913.
[8] Gao D, Wang R, Shan S, et al. Cric: A vqa dataset for compositional reasoning on vision and commonsense[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(5): 5561-5578.
[9] Agrawal A, Batra D, Parikh D, et al. Don't just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4971-4980.
[10] Han X, Wang S, Su C, et al. Greedy gradient ensemble for robust visual question answering[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1584-1593.
[11] Liu J, Fan C F, Zhou F, et al. Be flexible! learn to debias by sampling and prompting for robust visual question answering[J]. Information Processing & Management, 2023, 60(3): 103296.
[12] Cho J W, Kim D J, Ryu H, et al. Generative bias for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 11681-11690.
[13] Nam J, Cha H, Ahn S, et al. Learning from failure: Training debiased classifier from biased classifier, 2020[J]. URL https://arxiv. org/abs, 2007.
[14] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.
[15] Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490, 2019.
[16] Cadene R, Dancette C, et al. Rubi: Reducing unimodal biases for visual question answering[J]. Advances in neural information processing systems, 2019, 32.
[17] Chen L, Yan X, Xiao J, et al. Counterfactual samples synthesizing for robust visual question answering[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10800-10809.
[18] Si Q, et al. Towards robust visual question answering: Making the most of biased samples via contrastive learning[J]. arXiv preprint arXiv:2210.04563, 2022.
[19] Liu Y, Guo Y, Yin J, et al. Answer questions with right image regions: A visual attention regularization approach[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(4): 1-18.
[20] Clark C, Yatskar M, Zettlemoyer L. Don't take the easy way out: Ensemble based methods for avoiding known dataset biases[J]. arXiv preprint arXiv:1909.03683, 2019.
[21] Han X, Wang S, Su C, et al. General greedy de-bias learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 9789-9805.
[22] Kolling C, More M, Gavenski N, et al. Efficient counterfactual debiasing for visual question answering[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 20223001-3010.
[23] Chen L, Zheng Y, Xiao J. Rethinking data augmentation for robust visual question answering[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 95-112.
[24] Liang Z, Jiang W, Hu H, et al. Learning to contrast the counterfactual samples for robust visual question answering[C]//Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 2020: 3285-3292.
[25] Zhu X, Mao Z, Liu C, et al. Overcoming language priors with self-supervised learning for visual question answering[J]. arXiv preprint arXiv:2012.11528, 2020.
[26] Si Q, Lin Z, Zheng M, et al. Check it again: Progressive visual question answering via visual entailment[J]. arXiv preprint arXiv:2106.04605, 2021.
[27] Guo Y, Nie L, Cheng Z, et al. Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view[J]. IEEE Transactions on Image Processing, 2021, 31: 227-238.
[28] Basu A, Addepalli S, Babu R V. Rmlvqa: A margin loss approach for visual question answering with language biases[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 11671-11680.
[29] Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural network[J]. Advances in neural information processing systems, 2015, 28.
[30] Molchanov P, Tyree S, Karras T, et al. Pruning convolutional neural networks for resource efficient inference[J]. arXiv preprint arXiv:1611.06440, 2016.
[31] Liu Z, Li J, Shen Z, et al. Learning efficient convolutional networks through network slimming[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2736-2744.
[32] Zhu M, Gupta S. To prune, or not to prune: exploring the efficacy of pruning for model compression[J]. arXiv preprint arXiv:1710.01878, 2017.
[33] Frankle J, Carbin M. The lottery ticket hypothesis: Finding sparse, trainable neural networks[J]. arXiv preprint arXiv:1803.03635, 2018.
[34] Sanh V, Wolf T, Belinkov Y, et al. Learning from others' mistakes: Avoiding dataset biases without modeling them[J]. arXiv preprint arXiv:2012.01300, 2020.
[35] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax[J]. arXiv preprint arXiv:1611.01144, 2016.
[36] Girshick R. Fast r-cnn[J]. arXiv preprint arXiv:1504.08083, 2015.
[37] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.
[38] Cho K. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014.
[39] Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29.
[40] Kim J H, Jun J, Zhang B T. Bilinear attention networks[J]. Advances in neural information processing systems, 2018, 31.
[41] Bi Y, Jiang H, Hu Y, et al. See and learn more: Dense caption-aware representation for visual question answering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 34(2): 1135-1146.
[42] Agrawal A, Batra D, Parikh D, et al. Don't just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4971-4980.
[43] Bi Y, Jiang H, Hu Y, et al. Fair Attention Network for Robust Visual Question Answering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024: 1-12.
[44] Pan Y, Liu J, Jin L, et al. Unbiased Visual Question Answering by Leveraging Instrumental Variable[J]. IEEE Transactions on Multimedia, 2024.

选择文件类型/文献管理软件名称

选择包含的内容