[1] KULKARNI G, PREMRAJ V, ORDONEZ V, et al.Babytalk:understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903. [2] LI S, KULKARNI G, BERG T, et al.Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the 15th Conference on Computational Natural Language Learning.Washington D.C., USA:IEEE Press, 2011:220-228. [3] HODOSH M, YOUNG P, HOCKENMAIER J.Framing image description as a ranking task:data, models and evaluation metrics[J].Journal of Artificial Intelligence Research, 2013, 47:853-899. [4] MAO J H, XU W, YANG Y, et al.Explain images with multimodal recurrent neural networks[EB/OL].[2022-03-12].https://arxiv.org/abs/1410.1090. [5] VINYALS O, TOSHEV A, BENGIO S, et al.Show and tell:a neural image caption generator[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3156-3164. [6] KARPATHY A, LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of IEEE Transactions on Pattern Analysis and Machine Intelligence.Washington D.C., USA:IEEE Press, 2016:664-676. [7] YANG Z L, YUAN Y, WU Y X, et al.Encode, review, and decode:reviewer module for caption generation[EB/OL].[2022-03-12].https://arxiv.org/abs/1605.07912. [8] XU K, BA J, KIROS R, et al.Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning.New York, USA:ACM Press, 2015:2048-2057. [9] LU J S, XIONG C M, PARIKH D, et al.Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3242-3250. [10] ANDERSON P, HE X D, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[EB/OL].[2022-03-12].https://arxiv.org/abs/1707.07998. [11] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all You need[EB/OL].[2022-03-12].https://arxiv.org/abs/1706.03762. [12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].[2022-03-12].https://arxiv.org/abs/2010.11929. [13] LIU Z, LIN Y T, CAO Y, et al.Swin Transformer:hierarchical vision transformer using shifted windows[EB/OL].[2022-03-12].https://arxiv.org/abs/2103.14030. [14] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [15] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2014:740-755. [16] PAPINENI K, ROUKOS S, WARD T, et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.New York, USA:ACM Press, 2002:311-318. [17] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:4566-4575. [18] DENKOWSKI M, LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.Stroudsburg, USA:Association for Computational Linguistics, 2014:376-380. [19] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of Workshop on Text Summarization Branches Out.Barcelona, Spain:[s.n.], 2004:74-81. [20] DENG J, DONG W, SOCHER R, et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2009:248-255. [21] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778. [22] RENNIE S J, MARCHERET E, MROUEH Y, et al.Self-critical sequence training for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1179-1195. [23] YAO T, PAN Y W, LI Y H, et al.Boosting image captioning with attributes[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:4904-4912. [24] 韦人予, 蒙祖强.基于注意力特征自适应校正的图像描述模型[J].计算机应用, 2020, 40(S1):45-50. WEI R Y, MENG Z Q.Image caption model based on attention feature adaptive recalibration[J].Journal of Computer Applications, 2020, 40(S1):45-50.(in Chinese) [25] ZHONG X, NIE G Z, HUANG W X, et al.Attention-guided image captioning with adaptive global and local feature fusion[J].Journal of Visual Communication and Image Representation, 2021, 78:103138. [26] YUN J, XU Z W, GAO G L.Gated object-attribute matching network for detailed image caption[J].Mathematical Problems in Engineering, 2020, 2020:1-11. |