基于视觉-语言预训练模型的开集交通目标检测算法

doi:10.19678/j.issn.1000-3428.0069168

摘要/Abstract

摘要：

交通目标检测是智慧交通系统的重要组成部分, 但现有的交通目标检测算法只能实现对于预设目标的检测, 无法应对开集目标场景。为此, 提出一种基于视觉-语言预训练(VLP)模型的开集交通目标检测算法。首先, 基于Faster R-CNN修改预测网络使其能够适应开集目标的定位问题, 并将损失函数改进为交并比(IoU)损失, 有效提升定位精度; 其次, 构建一种新的基于VLP的标签匹配网络(VLP-LMN), 对预测框进行标签匹配, VLP模型作为一个强大的知识库, 可有效匹配区域图像和标签文本, 同时, VLP-LMN的提示工程和微调网络模块可以更好地发掘出VLP模型的性能, 有效提高VLP模型标签匹配的准确性。实验结果表明, 该算法在PASCAL VOC07+12数据集上实现了60.3%的新类目标检测平均准确率, 这证明了其具有良好的开集目标检测性能; 同时在交通数据集上的新类目标检测平均准确率达到了58.9%, 作为零样本检测, 仅比基类目标低14.5%, 证明了该算法在交通目标检测上具有良好的泛化能力。

关键词: 视觉-语言预训练模型, Faster R-CNN, 开集目标检测, 交通目标检测

Abstract:

Traffic object detection is a crucial component of intelligent transportation systems. However, existing traffic object detection algorithms can only detect predefined objects and are incapable of handling open-set object scenarios. To address this, a novel open-set traffic object detection algorithm based on a Visual-Language Pre-trained (VLP) model is proposed. First, by leveraging Faster R-CNN as a foundation, the prediction network is modified to adapt to the localization challenges of open-set objects. The loss function is refined to the Intersection over Union (IoU) loss, effectively enhancing the localization accuracy. Second, a new VLP-based Label Matching Network (VLP-LMN) is constructed to perform label matching on the predicted bounding boxes. The VLP model serves as a potent knowledge repository that effectively matches regional images with labelled text. Simultaneously, prompt engineering and fine-tuning of network modules facilitate better exploration of the VLP model's performance, significantly improving the accuracy of label matching. The algorithm achieves an average detection accuracy of 60.3% for new classes on the PASCAL VOC07+12 dataset, demonstrating its commendable performance in open-set object detection. Additionally, the average detection accuracy for new classes on a traffic dataset reaches 58.9%, with only a 14.5% decrease compared with the base classes in zero-shot detection. This underscores the strong generalization capabilities of the algorithm in traffic object detection.

Key words: Visual-Language Pre-trained (VLP) model, Faster R-CNN, open-set object detection, traffic object detection

黄琦强, 安国成, 熊刚. 基于视觉-语言预训练模型的开集交通目标检测算法[J]. 计算机工程, 2025, 51(6): 375-384.

HUANG Qiqiang, AN Guocheng, XIONG Gang. Open-Set Traffic Object Detection Algorithm Based on Vision-Language Pre-training Model[J]. Computer Engineering, 2025, 51(6): 375-384.

https://www.ecice06.com/CN/Y2025/V51/I6/375

图/表 16

图1 基于VLP模型的开集交通目标检测算法框架

Fig.1 Framework of open set traffic object detection algorithm based on VLP model

图2 Faster R-CNN的网络结构

Fig.2 Network structure of Faster R-CNN

图3 原始预测头与OFR-CNN预测头

Fig.3 Original prediction head and OFR-CNN prediction head

图4 两种损失对比

Fig.4 Comparison of two types losses

图5 VLP-LMN结构

Fig.5 Structure of VLP-LMN

图6 VLP模型示意图

Fig.6 Schematic diagram of the VLP model

图7 VLP-LMN在验证集真实框上的混淆矩阵

Fig.7 Confusion matrix of VLP-LMN on the truth box of validation set

图8 开集交通目标检测算法效果

Fig.8 Effect of open set traffic object detection algorithm

参考文献 29

1	柳长源, 曹园园, 罗一鸣. 基于视频图像的车辆实时检测系统. 计算机工程, 2019, 45 (2): 265-269, 277. doi: 10.19678/j.issn.1000-3428.0048973
	LIU C Y , CAO Y Y , LUO Y M . Real-time vehicle detection system based on video image. Computer Engineering, 2019, 45 (2): 265-269, 277. doi: 10.19678/j.issn.1000-3428.0048973
2	李明熹, 林正奎, 曲毅. 计算机视觉下的车辆目标检测算法综述. 计算机工程与应用, 2019, 55 (24): 20- 28.
	LI M X , LIN Z K , QU Y . Survey of vehicle object detection algorithm in computer vision. Computer Engineering and Applications, 2019, 55 (24): 20- 28.
3	李松江, 耿兰兰, 王鹏. 基于改进Yolov4的车辆目标检测. 计算机工程, 2023, 49 (4): 272- 280. doi: 10.19678/j.issn.1000-3428.0062943
	LI S J , GENG L L , WANG P . Vehicle target detection based on improved Yolov4. Computer Engineering, 2023, 49 (4): 272- 280. doi: 10.19678/j.issn.1000-3428.0062943
4	杨秀璋, 武帅, 李娜, 等. 复杂环境下自适应去雾的YOLOv3汽车识别算法. 计算机科学, 2023, 50 (S2): 208- 215.
	YANG X Z , WU S , LI N , et al. YOLOv3 vehicle recognition algorithm for adaptive dehazing in complex environments. Computer Science, 2023, 50 (S2): 208- 215.
5	REN S , HE K , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
6	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 779-788.
7	REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 7263-7271.
8	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-12-01]. https://arxiv.org/abs/1804.02767.
9	BOCHKOVSKIY A, WANG C Y, LIAO H M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2023-12-01]. https://arxiv.org/abs/2004.10934v1.
10	GEIGER A , LENZ P , STILLER C , et al. Vision meets robotics: the KITTI dataset. International Journal of Robotics Research, 2013, 32 (11): 1231- 1237. doi: 10.1177/0278364913491297
11	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 324-333.
12	ROMERA-PAREDES B, TORR P H S, ROMERA-PAREDES B, et al. An embarrassingly simple approach to zero-shot learning[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York, USA: ACM Press, 2015: 2152-2161.
13	BANSAL A, SIKKA K, SHARMA G, et al. Zero-shot object detection[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 397-414.
14	XIAN Y Q, AKATA Z, SHARMA G, et al. Latent embeddings for zero-shot classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 69-77.
15	ZAERIAN A, ROSA K D, HU D H, et al. Open-vocabulary object detection using captions[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 14393-14402.
16	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2021: 8748-8763.
17	GU X Y, LIN T Y, KUO W C, et al. Open-vocabulary object detection via vision and language knowledge distillation[EB/OL]. [2023-12-01]. https://arxiv.org/abs/2104.13921v3.
18	ZHONG Y W, YANG J W, ZHANG P C, et al. RegionCLIP: region-based language-image pretraining[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 16793-16803.
19	JIA C, YANG Y, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of IEEE International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2021: 4904-4916.
20	LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]//Proceedings of Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 9694-9705.
21	LI J, LI D, XIONG C, et al. Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2022: 12888-12900.
22	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE Press, 2016: 770-778.
23	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 356-366.
24	CARION N , MASSA F , SYNNAEVE G , et al. End-to-end object detection with transformers. Berlin, Germany: Springer, 2020.
25	YU J H, JIANG Y N, WANG Z Y, et al. UnitBox: an advanced object detection network[C]//Proceedings of the 24th ACM International Conference on Multimedia. New York, USA: ACM Press, 2016: 516-520.
26	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2023-12-01]. https://arxiv.org/abs/2010.11929.
27	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-12-01]. https://arxiv.org/abs/1810.04805v2.
28	LIU W. SSD: single shot MultiBox detector[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2015: 5675-567.
29	BERKAN D. Zero-shot object detection by hybrid region embedding[C]//Proceedings of British Machine Vision Conference. London, UK: [s. n. ], 2018: 431-442.

[1]	孙仁科, 许靖昊, 皇甫志宇, 李仲年, 许新征. 基于视觉-语言预训练模型的零样本迁移学习方法综述[J]. 计算机工程, 2024, 50(10): 1-15.
[2]	卢利琼, 陈长江, 吴东, 熊建芳. 自然场景盲文图像数据集及盲文段检测方法[J]. 计算机工程, 2023, 49(10): 171-177.
[3]	崔坤坤, 樊绍胜. 基于动态双窗口的机器人视觉导航与特征识别方法[J]. 计算机工程, 2020, 46(9): 313-320.
[4]	陈泽, 叶学义, 钱丁炜, 魏阳洋. 基于改进Faster R-CNN的小尺度行人检测[J]. 计算机工程, 2020, 46(9): 226-232,241.
[5]	林封笑,陈华杰,姚勤炜,张杰豪. 基于混合结构卷积神经网络的目标快速检测算法[J]. 计算机工程, 2018, 44(12): 222-227.

选择文件类型/文献管理软件名称

选择包含的内容