面向交通场景的强鲁棒性场景图生成网络

doi:10.19678/j.issn.1000-3428.0069154

摘要/Abstract

摘要：

交通场景图是对交通场景进行结构化表示，在智能交通领域中发挥着重要作用。当前场景图生成方法通过预测实体对之间的关系以生成无偏场景图。然而，由于数据集的长尾分布与实体关系的模糊特征表示，因此现有方法生成的交通场景图无法为下游任务提供准确且具有丰富含义的交通场景信息。为了解决上述问题，提出1个上下文语义嵌入(CSE)和粗细粒度混合(CFGB)的交通场景图生成网络CSE-CFGB。使用CSE模块建立实体与谓词的独特语义表示，使用CFGB网络对实体间关系谓词进行强鲁棒性预测，主干分支(MB)使用CSE表示对实体之间的关系进行直接预测，粗粒度分支(CB)使用重加权机制负责学习头部谓词的鲁棒特征，而细粒度分支(FB)使用Logit调整方法负责细化对尾部谓词的学习，再配备分支权重表，使2个辅助分支能很好地合作以帮助MB平衡头部和尾部谓词的预测结果。在Visual Genome数据集上的实验结果表明，所提的场景图生成网络在PredCls任务中取得了平均性能指标Mean@50和Mean@100分别为49.5%和51.7%，能有效解决模型训练中实体关系表示模糊和数据集长尾分布的问题。

关键词: 场景图生成, 长尾分布, 特征表示, 上下文语义嵌入, 粗细粒度混合

Abstract:

Traffic scene graph plays an important role in structurally representing traffic scenes. Current methods for scene graph generation predict relationships between entities to generate unbiased scene graphs. However, with existing methods, the long-tailed distribution of datasets and ambiguous feature representation of entity relationships result in traffic scene graphs that fail to provide accurate and meaningful traffic scene information for downstream tasks. To address these issues, this study proposes a Contextual Semantic Embedding (CSE) and Coarse-Fine-Grained Blending (CFGB) traffic scene graph generation network CSE-CFGB. Specifically, the CSE module is used to establish the unique semantic representations of entities and predicates. Subsequently, the CFGB network is employed to robustly predict relationships between entities. The Main Branch (MB) utilizes CSE to directly predict relationships between entities; the Coarse-grained Branch (CB) is responsible for learning robust features of head predicates using a reweighting mechanism; and the Fine-grained Branch (FB) refines the learning of tail predicates using a Logit adjustment method. Additionally, a branch weights table is incorporated to facilitate cooperation between the two auxiliary branches and help balance the prediction performance of the head and tail predicates by the MB. In experimental evaluations conducted on the Visual Genome dataset, the proposed scene graph generation network achieved excellent performance in the PredCls task, with average performance metrics Mean@50和Mean@100 reaching 49.5% and 51.7%, respectively. The experimental results indicate that the proposed method addresses the issues of ambiguous entity relationship representation and long-tailed distributions in a dataset during model training.

Key words: scene graph generation, long-tail distribution, feature representation, Contextual Semantic Embedding (CSE), Coarse-Fine-Grained Blending (CFGB)

周玮, 闵卫东. 面向交通场景的强鲁棒性场景图生成网络[J]. 计算机工程, 2025, 51(9): 231-241.

ZHOU Wei, MIN Weidong. Robust Scene Graph Generation Network for Traffic Scenes[J]. Computer Engineering, 2025, 51(9): 231-241.

https://www.ecice06.com/CN/Y2025/V51/I9/231

图/表 7

图1 关系标签在Visual Genome数据集上的数量分布

Fig.1 The quantity distribution of relationship label on Visual Genome dataset

图2 基于上下文语义嵌入的粗细粒度混合网络整体结构

Fig.2 Overall structure of coarse-fine-grained blending network based on contextual semantic embedding

图3 Motifs和CSE-CFGB在Visual Genome数据集上的性能比较

Fig.3 Performance comparison between Motifs and CSE-CFGB on the Visual Genome dataset

图4 在PredCls任务中Motifs和CSE-CFGB的可视化结果

Fig.4 Visualization results of Motifs and CSE-CFGB on the PredCls task

参考文献 31

1	CHANG X J, REN P Z, XU P F, et al. A comprehensive survey of scene graphs: generation and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 45(1): 1- 26.
2	JI C Q, WANG B B, JING Q, et al. Survey of deep feature instance level image retrieval algorithms. Journal of Frontiers of Computer Science & Technology, 2023, 17(7): 1565.
3	SHARMA H, PADHA D. A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artificial Intelligence Review, 2023, 56(11): 13619- 13661. doi: 10.1007/s10462-023-10488-2
4	WANG Y, SUN H C. Review of visual question answering technology. Journal of Frontiers of Computer Science & Technology, 2023, 17(7): 1487- 1495.
5	DUAN J W, MIN W D, LIN D Y, et al. Multimodal graph inference network for scene graph generation. Applied Intelligence, 2021, 51, 8768- 8783. doi: 10.1007/s10489-021-02304-7
6	段静雯, 闵卫东, 杨子元, 等. 提取全局语义信息的场景图生成算法. 中国图象图形学报, 2022, 27(7): 2214- 2225.
	DUAN J W, MIN W D, YANG Z Y, et al. Global semantic information extraction based scene graph generation algorithm. Journal of image and Graphics, 2022, 27(7): 2214- 2225.
7	CHEN C, ZHAN Y B, YU B S, et al. Resistance training using prior bias: toward unbiased scene graph generation[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/2201.06794.
8	ZHANG J, ELHOSEINY M, COHEN S, et al. Relationship proposal networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 5678-5686.
9	XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 5410-5419.
10	CHEN Y N, WANG Y J, ZHANG Y, et al. PANet: a context based predicate association network for scene graph generation[C]//Proceedings of IEEE International Conference on Multimedia and Expo (ICME). Washington D.C., USA: IEEE Press, 2019: 508-513.
11	ZELLERS R, YATSKAR M, THOMSON S, et al. Neural Motifs: scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 5831-5840.
12	YIN G J, SHENG L, LIU B, et al. Zoom-Net: mining deep feature interactions for visual relationship recognition[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1807.04979.
13	LI Y K, OUYANG W L, ZHOU B L, et al. Factorizable Net: an efficient subgraph-based framework for scene graph generation[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1806.11538.
14	TANG K H, ZHANG H W, WU B Y, et al. Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 6619-6628.
15	HE H B, GARCIA E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263- 1284. doi: 10.1109/TKDE.2008.239
16	BYRD J, LIPTON Z. What is the effect of importance weighting in deep learning?[C]//Proceedings of International Conference on Machine Learning. [S.l.]: AAAI Press. 2019: 872-881.
17	LI R J, ZHANG S Y, WAN B, et al. Bipartite graph network with adaptive message passing for unbiased scene graph generation[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/2104.00308.
18	CUI Y, JIA M L, LIN T Y, et al. Class-balanced loss based on effective number of samples[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 9268-9277.
19	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 2980-2988.
20	TANG K H, NIU Y L, HUANG J Q, et al. Unbiased scene graph generation from biased training[EB/OL]. [2023-11-28]. https://arxiv.org/abs/2002.11949v3.
21	HAN X T, YANG J W, HU H D, et al. Image scene graph generation (SGG) benchmark[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/2107.12604.
22	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6): 1137- 1149.
23	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
24	PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). San Diego, USA: Association for Computational linguistics, 2014: 1532-1543.
25	KRISHNA R, ZHU Y K, GROTH O, et al. Visual Genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123, 32- 73.
26	YANG J W, LU J S, LEE S, et al. Graph R-CNN for scene graph generation[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/1808.00191.
27	TAO L T, MI L, LI N N, et al. Predicate correlation learning for scene graph generation[EB/OL]. [2023-11-28]. https://arxiv.org/pdf/2107.02713.
28	XIE S, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 1492-1500.
29	DONG X N, GAN T, SONG X M, et al. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 19427-19436.
30	LI W, ZHANG H, BAI Q, et al. PPDL: predicate probability distribution based loss for unbiased scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 19447-19456.
31	LIN X, DING C, ZHANG J, et al. RU-Net: regularized unrolling network for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 19457-19466.

[1]	刘晓黎, 王轶彤. 基于自监督学习的多密度图会话推荐[J]. 计算机工程, 2023, 49(9): 60-68, 78.
[2]	徐上上, 孙福振, 王绍卿, 董家玮, 吴田慧. 基于图神经网络的异构信任推荐算法[J]. 计算机工程, 2022, 48(9): 89-95,104.
[3]	范馨月, 鲍泓, 潘卫国. 基于类别不平衡数据集的图像实例分割方法[J]. 计算机工程, 2022, 48(12): 224-231.
[4]	杨兵, 刘晓芳, 张纠. 基于深度特征聚合网络的医学图像分割[J]. 计算机工程, 2021, 47(4): 187-196.
[5]	刘月, 翟东海, 任庆宁. 基于注意力CNLSTM模型的新闻文本分类[J]. 计算机工程, 2019, 45(7): 303-308,314.
[6]	余冲,李晶,孙旭东,傅向华. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017, 43(12): 184-191.
[7]	魏维;叶斌;张元茂. 视频语义分析内容表征方式研究[J]. 计算机工程, 2007, 33(13): 218-220，.

选择文件类型/文献管理软件名称

选择包含的内容