[1] HOSSAIN M D Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys, 2019, 51(6): 1-36. [2] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO:common objects in context[C]//Proceedings of European Conference on Computer Vision.Zurich, Switzerland:[s.n.], 2014:740-755. [3] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, USA:IEEE Press, 2016:770-778. [4] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. [5] XU K, BA J, KIROS R, et al. Show, attend and tell:neural image caption generation with visual attention[EB/OL]. [2020-04-01]. https://arxiv.org/abs/1502.03044. [6] LU J S, XIONG C M, PARIKH D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[EB/OL]. [2020-04-01]. https://arxiv.org/abs/1612.01887. [7] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City, USA:IEEE Press, 2018:6077-6086. [8] YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, USA:IEEE Press, 2016:4651-4659. [9] DING S, QU S, XI Y, et al. Image caption generation with high-level image features[J]. Pattern Recognition Letters, 2019, 123:89-95. [10] PENG Y Q, LIU X, WANG W H, et al. Image caption model of double LSTM with scene factors[J]. Image and Vision Computing, 2019, 86(12): 38-44. [11] HE X W, SHI B G, BAI X, et al. Image caption generation with part of speech guidance[J]. Pattern Recognition Letters, 2019, 119:229-237. [12] ZHANG W, NIE W B, LI X L, et al. Image caption generation with adaptive transformer[C]//Proceedings of the 34rd Youth Academic Annual Conference of Chinese Association of Automation.Jinzhou, China:[s.n.], 2019:521-526. [13] PAPINENI K, ROUKOS S, WARD T, et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia, USA:ACL Press, 2002:311-318. [14] BANERJEE S, LAVIE A.METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor, USA:ACL Press, 2005:65-72. [15] ANNE H L, VENUGOPALAN S, ROHRBACH M, et al. Deep compositional captioning:Describing novel object categories without paired training data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas, USA:IEEE Press, 2016:1-10. [16] YAO T, PAN Y W, LI Y H, et al. Incorporating copying mechanism in image captioning for learning novel objects[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA:IEEE Press, 2017:6580-6588. [17] LI Y H, YAO T, PAN Y W, et al. Pointing novel objects in image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Sacramento, USA:IEEE Press, 2019:12497-12506. [18] LU J, YANG J, BATRA D, et al. Neural baby talk[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City, USA:IEEE Press, 2018:7219-7228. [19] BIZER C, LEHMANN J, KOBILAROV G, et al. DBpedia-a crystallization point for the Web of data[J]. Web Semantics Science Services & Agents on the World Wide Web, 2009, 7(3): 154-165. [20] LIU H, SINGH P.ConceptNet-a practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4): 211-226. [21] ZHOU Y M, SUN Y W, HONAVAR V.Improving image captioning by leveraging knowledge graphs[EB/OL]. [2020-04-01]. https://arxiv.org/abs/1901.08942. [22] REDMON J, FARHADI A.YOLO9000:better, faster, stronger[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA:IEEE Press, 2017:7263-7271. [23] LU D, WHITEHEAD S, HUANG L, et al. Entity-aware image caption generation[C]//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing.Brussels, Belgium:ACL Press, 2018:4013-4023. [24] MOGADALA A, BISTA U, XIE L, et al. Knowledge guided attention and inference for describing images containing unseen objects[C]//Proceedings of European Semantic Web Conference.Heraklion, Greece:Springer, 2018:415-429. [25] 徐守坤, 吉晨晨, 倪楚涵, 等. 融合施工场景及空间关系的图像描述生成模型[J]. 计算机工程, 2020, 46(6): 256-265. XU S K, JI C C, NI C H, et al. Image description generation model integrating construction scenes and spatial relationship[J]. Computer Engineering, 2020, 46(6): 256-265.(in Chinese) [26] BORDES A, USUNIER N, GARCIA-DURAN A, et al. Translating embeddings for modeling multi-relational data[C]//Proceedings of NIPS'13.Lake Tahoe, USA:MIT Press, 2013:2787-2795. [27] HSIEH T I, LO Y C, CHEN H T, et al. One-shot object detection with co-attention and co-excitation[C]//Proceedings of NIPS'19.[S.1.]:MIT Press, 2019:2721-2730. [28] KANG B Y, LIU Z, WANG X, et al. Few-shot object detection via feature reweighting[C]//Proceedings of IEEE International Conference on Computer Vision.California, USA:IEEE Press, 2019:8420-8429. [29] REN S, HE K, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [30] RISTOSKI P, PAULHEIM H.Rdf2vec:RDF graph embeddings for data mining[C]//Proceedings of International Semantic Web Conference.Kuala Lumpur, Malaysia:Springer, 2016:498-514. [31] PENNINGTON J, SOCHER R, MANNING C D.Glove:global vectors for word representation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Doha, Qatar:ACL Press, 2014:1532-1543. [32] WILLIAMS R J, ZIPSER D.A learning algorithm for continually running fully recurrent neural networks[J]. Neural Computation, 1989, 1(2): 270-280. [33] FAN Q, ZHUO W, TANG C K, et al. Few-shot object detection with attention-RPN and multi-relation detector[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Seattle, USA:IEEE Press, 2020:4013-4022. [34] DENG J, DONG W, SOCHER R, et al. ImageNet:a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Miami, USA:IEEE Press, 2009:248-255. [35] PASZKE A, GROSS S, MASSA F, et al. PyTorch:an imperative style, high-performance deep learning library[EB/OL]. [2020-04-01]. https://arxiv.org/abs/1912.01703. [36] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of Workshop on Text Summarization Branches Out at ACL.Barcelona, Spain:ACL Press, 2004:74-81. [37] ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE:semantic propositional image caption evaluation[EB/OL]. [2020-04-01]. https://arxiv.org/abs/1607.08822. [38] VENUGOPALAN S, ANNE H L, ROHRBACH M, et al. Captioning images with diverse objects[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Honolulu, USA:IEEE Press, 2017:5753-5761. |