Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (6): 375-384. doi: 10.19678/j.issn.1000-3428.0069168

• Development Research and Engineering Application • Previous Articles     Next Articles

Open-Set Traffic Object Detection Algorithm Based on Vision-Language Pre-training Model

HUANG Qiqiang1, AN Guocheng2,*(), XIONG Gang1   

  1. 1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
    2. Artificial Intelligence Research Institute, Shanghai Huaxun Network System Co., Ltd., Chengdu 610074, Sichuan, China
  • Received:2024-01-04 Online:2025-06-15 Published:2024-06-03
  • Contact: AN Guocheng

基于视觉-语言预训练模型的开集交通目标检测算法

黄琦强1, 安国成2,*(), 熊刚1   

  1. 1. 上海交通大学电子信息与电气工程学院, 上海 200240
    2. 上海华讯网络系统有限公司行业数智事业部, 四川 成都 610074
  • 通讯作者: 安国成
  • 基金资助:
    “十四五”国家重点研发计划(2023YFC3006700); 国家自然科学基金(62071293)

Abstract:

Traffic object detection is a crucial component of intelligent transportation systems. However, existing traffic object detection algorithms can only detect predefined objects and are incapable of handling open-set object scenarios. To address this, a novel open-set traffic object detection algorithm based on a Visual-Language Pre-trained (VLP) model is proposed. First, by leveraging Faster R-CNN as a foundation, the prediction network is modified to adapt to the localization challenges of open-set objects. The loss function is refined to the Intersection over Union (IoU) loss, effectively enhancing the localization accuracy. Second, a new VLP-based Label Matching Network (VLP-LMN) is constructed to perform label matching on the predicted bounding boxes. The VLP model serves as a potent knowledge repository that effectively matches regional images with labelled text. Simultaneously, prompt engineering and fine-tuning of network modules facilitate better exploration of the VLP model's performance, significantly improving the accuracy of label matching. The algorithm achieves an average detection accuracy of 60.3% for new classes on the PASCAL VOC07+12 dataset, demonstrating its commendable performance in open-set object detection. Additionally, the average detection accuracy for new classes on a traffic dataset reaches 58.9%, with only a 14.5% decrease compared with the base classes in zero-shot detection. This underscores the strong generalization capabilities of the algorithm in traffic object detection.

Key words: Visual-Language Pre-trained (VLP) model, Faster R-CNN, open-set object detection, traffic object detection

摘要:

交通目标检测是智慧交通系统的重要组成部分, 但现有的交通目标检测算法只能实现对于预设目标的检测, 无法应对开集目标场景。为此, 提出一种基于视觉-语言预训练(VLP)模型的开集交通目标检测算法。首先, 基于Faster R-CNN修改预测网络使其能够适应开集目标的定位问题, 并将损失函数改进为交并比(IoU)损失, 有效提升定位精度; 其次, 构建一种新的基于VLP的标签匹配网络(VLP-LMN), 对预测框进行标签匹配, VLP模型作为一个强大的知识库, 可有效匹配区域图像和标签文本, 同时, VLP-LMN的提示工程和微调网络模块可以更好地发掘出VLP模型的性能, 有效提高VLP模型标签匹配的准确性。实验结果表明, 该算法在PASCAL VOC07+12数据集上实现了60.3%的新类目标检测平均准确率, 这证明了其具有良好的开集目标检测性能; 同时在交通数据集上的新类目标检测平均准确率达到了58.9%, 作为零样本检测, 仅比基类目标低14.5%, 证明了该算法在交通目标检测上具有良好的泛化能力。

关键词: 视觉-语言预训练模型, Faster R-CNN, 开集目标检测, 交通目标检测