[1] ANTOL S, AGRAWAL A, LU J S, et al.VQA:visual question answering[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:2425-2433. [2] WU Q, WANG P, SHEN C H, et al.Ask me anything:free-form visual question answering based on knowledge from external sources[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:4622-4630. [3] LU J, YANG J, BATRA D, et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:289-297. [4] NOH H, SEO P H, HAN B.Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:30-38. [5] LI R, JIA J.Visual question answering with Question Representation Update(QRU)[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:4655-4663. [6] JAIN U, ZHANG Z Y, SCHWING A.Creativity:generating diverse questions using variational autoencoders[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:6485-6494. [7] TENEY D, LIU L Q, VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1-9. [8] KLINGLER F, DRESSLER F, CAO J N, et al.MCB-a multi-channel beaconing protocol[J].Ad Hoc Networks, 2016, 36:258-269. [9] SOTO-VALERO C.Predicting win-loss outcomes in MLB regular season games-a comparative study using data mining methods[J].International Journal of Computer Science in Sport, 2016, 15(2):91-112. [10] YU Z, YU J, FAN J P, et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:1821-1830. [11] YU Z, YU J, CUI Y H, et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6281-6290. [12] ANDERSON P, HE X D, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:6077-6086. [13] SHRESTHA R, KAFLE K, KANAN C.Answer them all! toward universal visual question answering models[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:10472-10481. [14] GAO P, JIANG Z K, YOU H X, et al.Dynamic fusion with intra- and inter-modality attention flow for visual question answering[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6639-6648. [15] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778. [16] 施政, 毛力, 孙俊.基于YOLO的多模态加权融合行人检测算法[J].计算机工程, 2021, 47(8):234-242. SHI Z, MAO L, SUN J.YOLO-based multi-modal weighted fusion pedestrian detection algorithm[J].Computer Engineering, 2021, 47(8):234-242.(in Chinese) [17] 顾砾, 季怡, 刘纯平.基于多模态特征融合的三维点云分类方法[J].计算机工程, 2021, 47(2):279-284. GU L, JI Y, LIU C P.Classification method of three-dimensional point cloud based on multiple modal feature fusion[J].Computer Engineering, 2021, 47(2):279-284.(in Chinese) [18] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [19] SANTORO A, RAPOSO D, BARRETT D G T, et al.A simple neural network module for relational reasoning[EB/OL].[2021-07-11].https://arxiv.org/abs/1706.01427. [20] TENEY D, ANDERSON P, HE X D, et al.Tips and tricks for visual question answering:learnings from the 2017 challenge[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:4223-4232. [21] KIM J H, JUN J, ZHANG B T.Bilinear attention networks[EB/OL].[2021-07-11].https://arxiv.org/abs/1805.07932. [22] PENNINGTON J, SOCHER R, MANNING C.GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:1532-1543. [23] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780. [24] GOYAL Y, KHOT T, SUMMERS-STAY D, et al.Making the V in VQA matter:elevating the role of image understanding in visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:6904-6913. [25] KINGMA D P, BA J.Adam:a method for stochastic optimization[EB/OL].[2021-07-11].https://arxiv.org/abs/1412.6980. [26] YU Z, YU J, XIANG C C, et al.Beyond bilinear:generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12):5947-5959. |