结合外部知识库与适应性推理的场景图生成模型

doi:10.19678/j.issn.1000-3428.0062268

摘要/Abstract

摘要： 为在场景图生成网络中获得重要的上下文信息，同时减少数据集偏差对场景图生成性能的影响，构建一种基于外部知识库与适应性推理的场景图生成模型。利用结合外部知识库的目标检测模块引入语言先验知识，提高实体对关系类别检测的准确性。设计基于Transformer架构的上下文信息提取模块，采用两个Transformer编码层对候选框和实体对关系类别进行处理，并利用自注意力机制分阶段实现上下文信息合并，获取重要的全局上下文信息。构建特征特殊融合的适应性推理模块，通过软化分布并根据实体对的视觉外观进行适应性推理关系分类，缓解实体对关系频率的长尾分布问题，提升模型推理能力。在VG数据集上的实验结果表明，与MOTIFS模型相比，该模型在谓词分类、场景图分类和场景图生成子任务上的Top-100召回率分别提升了1.4、4.3、7.1个百分点，对于多数关系类别具有更好的场景图生成效果。

关键词: 场景图, 视觉关系, 外部知识库, 注意力机制, 适应性推理

Abstract: To obtain better contextual information in the Scene Graph Generation(SGG) network while reducing the impact of dataset bias, this study proposes a SGG model based on an external knowledge base and adaptive reasoning.First, the proposed model uses a target-detection module combined with an external knowledge base to provide the model with linguistic priori knowledge to improve the accuracy of relationship-category detection for entity pairs.Second, the model designs a transformer architecture-based context information extraction module to process the candidate box and entity pair relationship labels through two transformer-coding layers, and merge the context information in stages using the self-attention mechanism to obtain more meaningful global context information.Finally, as the relationship frequencies are affected by the long-tail distribution, the model designs a feature-specific fusion of adaptive inference modules to alleviate this problem by softening the distribution and by adaptively reasoning about relationship classification based on the visual appearance of entity pairs.Experimental results on the Visual Genome (VG) dataset show that using the proposed model, Top-100 Recall(Recall@100, R@100) on Predicate Classification(PredCls), Scene Graph Classification(SGCls), and Scene Graph Generation(SGGen) subtasks is increased by 1.4, 4.3, and 7.1 percentage points, respectively, compared with the MOTIFS model.Furthermore, the proposed model achieves better SGG effect for most relationship categories.

Key words: scene graph, visual relationship, external knowledge base, attention mechanism, adaptive reasoning

中图分类号:

TP391

王旖旎, 高永彬, 万卫兵, 杨淑群, 郭茹燕. 结合外部知识库与适应性推理的场景图生成模型[J]. 计算机工程, 2022, 48(9): 230-238.

WANG Yini, GAO Yongbin, WAN Weibing, YANG Shuqun, GUO Ruyan. Scene Graph Generation Model Combined with External Knowledge Base and Adaptive Reasoning[J]. Computer Engineering, 2022, 48(9): 230-238.

http://www.ecice06.com/CN/Y2022/V48/I9/230

图/表 9

20221017110717

20221017110720

20221017110724

20221017110727

20221017110731

20221017110736

20221017110739

20221017110743

20221017110747

参考文献

[1] JOHNSON J, KRISHNA R, STARK M, et al.Image retrieval using scene graphs[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3668-3678.
[2] AGRAWAL A, LU J S, ANTOL S, et al.VQA:visual question answering[J].International Journal of Computer Vision, 2017, 123(1):4-31.
[3] JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al.CLEVR:a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2901-2910.
[4] YAO T, PAN Y W, LI Y H, et al.Exploring visual relationship for image captioning[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2018:684-699.
[5] CHANG A, SAVVA M, MANNING C D.Learning spatial knowledge for text to 3D scene generation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:2028-2038.
[6] REN S Q, HE K M, GIRSHICK R, et al.Faster RCNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
[7] YAO B P, LI F F.Modeling mutual context of object and human pose in human-object interaction activities[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2010:17-24.
[8] GUO G D, LAI A.A survey on still image based human action recognition[J].Pattern Recognition, 2014, 47(10):3343-3361.
[9] LU C W, KRISHNA R J, BERNSTEIN M, et al.Visual relationship detection with language priors[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:852-869.
[10] YANG J W, LU J S, LEE S, et al.Graph RCNN for scene graph generation[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2018:670-685.
[11] XU D F, ZHU Y K, CHOY C B, et al.Scene graph generation by iterative message passing[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:5410-5419.
[12] ZHANG H W, KYAW Z, CHANG S F, et al.Visual translation embedding network for visual relation detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3107-3115.
[13] ZELLERS R, YATSKAR M, THOMSON S, et al.Neural MOTIFS:scene graph parsing with global context[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:5831-5840.
[14] KRISHNA R, ZHU Y K, GROTH O, et al.Visual Genome:connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision, 2017, 123(1):32-73.
[15] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.
[16] CHEN T S, YU W H, CHEN R Q, et al.Knowledge-embedded routing network for scene graph generation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6163-6171.
[17] GIRSHICK R, DONAHUE J, DARRELL T, et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:580-587.
[18] REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:779-788.
[19] REDMON J, FARHADI A.YOLO9000:better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:7263-7271.
[20] WANG X L, SHRIVASTAVA A, GUPTA A.A-Fast-RCNN:hard positive generation via adversary for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2606-2615.
[21] LIU W, ANGUELOV D, ERHAN D, et al.SSD:single shot multibox detector[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:21-37.
[22] HE K M, GKIOXARI G, DOLLAR P, et al.Mask RCNN[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2961-2969.
[23] YU R C, LI A, MORARIU V I, et al.Visual relationship detection with internal and external linguistic knowledge distillation[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:1068-1076.
[24] CUI Z, XU C Y, ZHENG W M, et al.Context-dependent diffusion network for visual relationship detection[C]//Proceedings of the 26th ACM International Conference on Multimedia.New York, USA:ACM Press, 2018:1475-1482.
[25] PENNINGTON J, SOCHER R, MANNING C.GloVe:global vectors for word representation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg, USA:Association for Computational Linguistics, 2014:1532-1543.
[26] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of Conference on Neural Information Processing Systems.Cambridge, UK:MIT Press, 2017:5998-6008.
[27] LIN T Y, DOLLAR P, GIRSHICK R, et al.Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2117-2125.
[28] XIE S N, GIRSHICK R, DOLLAR P, et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1492-1500.
[29] ZHANG Y, HARE J, PRUGEL-BENNETT A.Learning to count objects in natural images for visual question answering[C]//Proceedings of International Conference on Learning Representations.New York, USA:ACM Press, 2018:3755.
[30] TANG K H, ZHANG H W, WU B Y, et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6619-6628.
[31] NEWELL A, DENG J.Pixels to graphs by associative embedding[C]//Proceedings of Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:2172-2181.
[32] LIN X, DING C X, ZENG J Q, et al.GPS-Net:graph property sensing network for scene graph generation[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:3746-3753.
[33] HUNG Z S, MALLYA A, LAZEBNIK S.Union visual translation embedding for visual relationship detection and scene graph generation[EB/OL].[2021-07-04].https://arxiv.org/abs/1905.11624.

选择文件类型/文献管理软件名称

选择包含的内容