基于空间关系与频率特征的视觉问答模型

doi:10.19678/j.issn.1000-3428.0062339

摘要/Abstract

摘要： 视觉问答作为多模态数据处理中的重要任务，需要将不同模态的信息进行关联表示。现有视觉问答模型无法有效区分相似目标对象且对于目标对象之间的空间关系表达不准确，从而影响模型整体性能。为充分利用视觉问答图像和问题中的细粒度信息与空间关系信息，基于自底向上和自顶向下的注意力（BUTD）模型及模块化协同注意力网络（MCAN）模型，结合空间域特征和频率域特征构造多维增强注意力（BUDR）模型和模块化共同增强注意力网络（MCDR）模型。利用离散余弦变换得到频率信息，改善图像细节丢失问题。采用关系网络学习空间结构信息和潜在关系信息，减少图像和问题特征出现对齐错误，并加强模型推理能力。在VQA v2.0数据集和test-dev验证集上的实验结果表明，BUDR和MCDR模型能够增强图像细粒度识别性能，提高图像和问题目标对象间的关联性，相比于BUTD和MCAN模型预测精确率分别提升了0.14和0.25个百分点。

关键词: 离散余弦变换, 细粒度识别, 关系网络, 注意力机制, 特征融合

Abstract: As an important task in multimodal data processing, Visual Question Answering(VQA) needs to associate and represent information from different modalities.However, existing VQA models can not effectively distinguish similar target objects and can not accurately express the spatial relationship between target objects, thus affecting the model's overall performance.They also have the problem of low recognition of similar objects and wrongly expressing the spatial relationship between target objects.To fully exploit fine-grained and spatial relationship information in images and questions of VQA, this study combines spatial domain and frequency domain features with the Bottom-Up and Top-Down attention(BUTD) model and Modular Co-Attention Network(MCAN) model to construct a multi-dimensional enhanced attention model, called BUDR, and a modular co-enhanced attention network model, called MCDR.BUDR and MCDR models use Discrete Cosine Transform(DCT) to obtain frequency information to improve the image detail loss problem, and Relation Network(RN) to learn spatial structure information and latent relational information to reduce the misalignment of image and question features, and enhance model reasoning capabilities.The experimental results on the VQA v2.0 dataset and the test-dev validation set show that the BUDR and MCDR models can enhance the performance of fine-grained image recognition and improve the correlation between the image and the target object of the question.Compared with the BUTD and MCAN models, the prediction accuracy of the BUDR and MCDR models is increased by 0.14 and 0.25 percentage points, respectively.

Key words: Discrete Cosine Transform(DCT), fine-grained identification, Relation Network(RN), attention mechanism, feature fusion

中图分类号:

TP183

付鹏程, 杨关, 刘小明, 刘阳, 张紫明, 成曦. 基于空间关系与频率特征的视觉问答模型[J]. 计算机工程, 2022, 48(9): 96-104.

FU Pengcheng, YANG Guan, LIU Xiaoming, LIU Yang, ZHANG Ziming, CHENG Xi. Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J]. Computer Engineering, 2022, 48(9): 96-104.

https://www.ecice06.com/CN/Y2022/V48/I9/96

图/表 15

20220924175428

20220924175431

20220924175435

20220924175439

20220924175442

20220924175446

20220924175449

20220924175452

20220924175457

20220924175500

20220924175614

20220924175620

20220924175624

20220924175630

20220924175638

参考文献

[1] ANTOL S, AGRAWAL A, LU J S, et al.VQA:visual question answering[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:2425-2433.
[2] WU Q, WANG P, SHEN C H, et al.Ask me anything:free-form visual question answering based on knowledge from external sources[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:4622-4630.
[3] LU J, YANG J, BATRA D, et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:289-297.
[4] NOH H, SEO P H, HAN B.Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:30-38.
[5] LI R, JIA J.Visual question answering with Question Representation Update(QRU)[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:4655-4663.
[6] JAIN U, ZHANG Z Y, SCHWING A.Creativity:generating diverse questions using variational autoencoders[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:6485-6494.
[7] TENEY D, LIU L Q, VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1-9.
[8] KLINGLER F, DRESSLER F, CAO J N, et al.MCB-a multi-channel beaconing protocol[J].Ad Hoc Networks, 2016, 36:258-269.
[9] SOTO-VALERO C.Predicting win-loss outcomes in MLB regular season games-a comparative study using data mining methods[J].International Journal of Computer Science in Sport, 2016, 15(2):91-112.
[10] YU Z, YU J, FAN J P, et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:1821-1830.
[11] YU Z, YU J, CUI Y H, et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6281-6290.
[12] ANDERSON P, HE X D, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:6077-6086.
[13] SHRESTHA R, KAFLE K, KANAN C.Answer them all! toward universal visual question answering models[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:10472-10481.
[14] GAO P, JIANG Z K, YOU H X, et al.Dynamic fusion with intra- and inter-modality attention flow for visual question answering[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6639-6648.
[15] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[16] 施政, 毛力, 孙俊.基于YOLO的多模态加权融合行人检测算法[J].计算机工程, 2021, 47(8):234-242. SHI Z, MAO L, SUN J.YOLO-based multi-modal weighted fusion pedestrian detection algorithm[J].Computer Engineering, 2021, 47(8):234-242.(in Chinese)
[17] 顾砾, 季怡, 刘纯平.基于多模态特征融合的三维点云分类方法[J].计算机工程, 2021, 47(2):279-284. GU L, JI Y, LIU C P.Classification method of three-dimensional point cloud based on multiple modal feature fusion[J].Computer Engineering, 2021, 47(2):279-284.(in Chinese)
[18] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
[19] SANTORO A, RAPOSO D, BARRETT D G T, et al.A simple neural network module for relational reasoning[EB/OL].[2021-07-11].https://arxiv.org/abs/1706.01427.
[20] TENEY D, ANDERSON P, HE X D, et al.Tips and tricks for visual question answering:learnings from the 2017 challenge[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:4223-4232.
[21] KIM J H, JUN J, ZHANG B T.Bilinear attention networks[EB/OL].[2021-07-11].https://arxiv.org/abs/1805.07932.
[22] PENNINGTON J, SOCHER R, MANNING C.GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:1532-1543.
[23] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.
[24] GOYAL Y, KHOT T, SUMMERS-STAY D, et al.Making the V in VQA matter:elevating the role of image understanding in visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:6904-6913.
[25] KINGMA D P, BA J.Adam:a method for stochastic optimization[EB/OL].[2021-07-11].https://arxiv.org/abs/1412.6980.
[26] YU Z, YU J, XIANG C C, et al.Beyond bilinear:generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12):5947-5959.

选择文件类型/文献管理软件名称

选择包含的内容