[1] JOHNSON J, KRISHNA R, STARK M, et al.Image retrieval using scene graphs[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3668-3678. [2] AGRAWAL A, LU J S, ANTOL S, et al.VQA:visual question answering[J].International Journal of Computer Vision, 2017, 123(1):4-31. [3] JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al.CLEVR:a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2901-2910. [4] YAO T, PAN Y W, LI Y H, et al.Exploring visual relationship for image captioning[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2018:684-699. [5] CHANG A, SAVVA M, MANNING C D.Learning spatial knowledge for text to 3D scene generation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:2028-2038. [6] REN S Q, HE K M, GIRSHICK R, et al.Faster RCNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [7] YAO B P, LI F F.Modeling mutual context of object and human pose in human-object interaction activities[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2010:17-24. [8] GUO G D, LAI A.A survey on still image based human action recognition[J].Pattern Recognition, 2014, 47(10):3343-3361. [9] LU C W, KRISHNA R J, BERNSTEIN M, et al.Visual relationship detection with language priors[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:852-869. [10] YANG J W, LU J S, LEE S, et al.Graph RCNN for scene graph generation[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2018:670-685. [11] XU D F, ZHU Y K, CHOY C B, et al.Scene graph generation by iterative message passing[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:5410-5419. [12] ZHANG H W, KYAW Z, CHANG S F, et al.Visual translation embedding network for visual relation detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3107-3115. [13] ZELLERS R, YATSKAR M, THOMSON S, et al.Neural MOTIFS:scene graph parsing with global context[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:5831-5840. [14] KRISHNA R, ZHU Y K, GROTH O, et al.Visual Genome:connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision, 2017, 123(1):32-73. [15] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780. [16] CHEN T S, YU W H, CHEN R Q, et al.Knowledge-embedded routing network for scene graph generation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6163-6171. [17] GIRSHICK R, DONAHUE J, DARRELL T, et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:580-587. [18] REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:779-788. [19] REDMON J, FARHADI A.YOLO9000:better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:7263-7271. [20] WANG X L, SHRIVASTAVA A, GUPTA A.A-Fast-RCNN:hard positive generation via adversary for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2606-2615. [21] LIU W, ANGUELOV D, ERHAN D, et al.SSD:single shot multibox detector[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:21-37. [22] HE K M, GKIOXARI G, DOLLAR P, et al.Mask RCNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2961-2969. [23] YU R C, LI A, MORARIU V I, et al.Visual relationship detection with internal and external linguistic knowledge distillation[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:1068-1076. [24] CUI Z, XU C Y, ZHENG W M, et al.Context-dependent diffusion network for visual relationship detection[C]//Proceedings of the 26th ACM International Conference on Multimedia.New York, USA:ACM Press, 2018:1475-1482. [25] PENNINGTON J, SOCHER R, MANNING C.GloVe:global vectors for word representation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:1532-1543. [26] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of Conference on Neural Information Processing Systems.Cambridge, UK:MIT Press, 2017:5998-6008. [27] LIN T Y, DOLLAR P, GIRSHICK R, et al.Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2117-2125. [28] XIE S N, GIRSHICK R, DOLLAR P, et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1492-1500. [29] ZHANG Y, HARE J, PRUGEL-BENNETT A.Learning to count objects in natural images for visual question answering[C]//Proceedings of International Conference on Learning Representations.New York, USA:ACM Press, 2018:3755. [30] TANG K H, ZHANG H W, WU B Y, et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6619-6628. [31] NEWELL A, DENG J.Pixels to graphs by associative embedding[C]//Proceedings of Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:2172-2181. [32] LIN X, DING C X, ZENG J Q, et al.GPS-Net:graph property sensing network for scene graph generation[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:3746-3753. [33] HUNG Z S, MALLYA A, LAZEBNIK S.Union visual translation embedding for visual relationship detection and scene graph generation[EB/OL].[2021-07-04].https://arxiv.org/abs/1905.11624. |