Large Model-Based Semantic Enhancement for Cross-View Vehicle Re-identification

doi:10.19678/j.issn.1000-3428.0253482

Abstract

Abstract: In traffic monitoring and public security scenarios, relying solely on ground-view or aerial-view vehicle re-identification often fails to meet the requirements of large-scale, complex, and multi-scene perception. Ground-view images contain rich visual details but suffer from limited field of view and frequent occlusions, whereas aerial views offer wide-area coverage but usually depict vehicles with small sizes and insufficient details, leading to degraded recognition performance. Therefore, fusing ground and aerial viewpoints for cross-view vehicle ReID has become a key research direction for enhancing large-scale traffic perception. However, this task is confronted with several challenges, including severe scale variations, large cross-view appearance discrepancies, intra-class distances exceeding inter-class distances, and limited cross-scene data. To this end, we propose a large model-based semantic enhancement method for cross-view vehicle re-identification. Built upon the CLIP-ReID multimodal framework, the proposed approach first employs Qwen-VL-Plus multimodal large model to automatically generate fine-grained structured descriptions for vehicle images, and then leverages Qwen-Max language large model to fuse semantic information from ground and aerial viewpoints, yielding a unified and stable cross-view semantic representation. This representation is further injected into a two-stage image–text contrastive learning scheme to strengthen the model’s domain generalization ability under cross-scene and cross-platform conditions. To promote practical deployment and subsequent research, we also construct a cross-view ground–aerial vehicle image dataset covering multiple flight altitudes, acquisition devices, and scene conditions, and design domain-generalization-oriented data splits and evaluation protocols as a new benchmark. Experimental results demonstrate that the proposed method significantly outperforms pure visual baselines on multiple metrics and achieves superior performance to state-of-the-art algorithms in cross-scene domain generalization tests, validating the effectiveness of semantic enhancement for cross-view vehicle re-identification. The proposed method shows strong application potential and engineering value in intelligent traffic surveillance, UAV-based patrol, and regional security.

摘要： 在交通监测与公共安全场景中，仅依赖地面或空中单一视角的车辆重识别往往难以满足广域、复杂、多场景的识别需求。地面视角虽然图像细节丰富，但视野受限且易受遮挡；空中视角具备大范围监视优势，却常因目标尺寸小、细节不足而造成识别性能下降。因此，融合地空视角开展跨视角车辆重识别，已成为提升大规模交通感知能力的研究热点。然而，该任务同时面临尺度变化剧烈、跨视角外观差异大、类内距离显著大于类间距离以及跨场景数据有限等挑战。为此，本文提出一种面向跨视角车辆重识别的大模型语义增强方法。方法基于CLIP-ReID多模态框架，首先利用Qwen-VL-Plus多模态大模型生成车辆图像的细粒度结构化描述，并借助Qwen-Max语言大模型融合来自地面与空中不同视角的语义信息，形成统一、稳定的跨视角语义表示。随后，将这一语义表示显式注入到两阶段图文对比学习中，以增强模型在跨场景、跨平台条件下的域泛化能力。为推动该方向的工程落地与后续研究，本文还构建了覆盖多种飞行高度、采集设备与场景条件的跨视角地空车辆图像数据集，并设计跨场景域泛化的数据划分与评测方案，为研究者提供新的标准测试基准。实验结果显示，所提方法在多项指标上显著优于纯视觉基线模型，特别是在跨场景域泛化测试中的表现领先于现有先进算法，验证了语义增强在跨视角识别任务中的有效性。该方法在智能交通监控、无人机巡查、区域安防等场景具有良好的应用前景和工程价值。

TANG Zhi-Wen, HU Xing-Chen, HU Yi-Hui, GUO Tian-Xiang, LI Shuo-Hao, HUANG Jin-Cai. Large Model-Based Semantic Enhancement for Cross-View Vehicle Re-identification[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253482.

唐智文, 胡星辰, 胡意晖, 郭天翔, 李硕豪, 黄金才. 面向跨视角车辆重识别的大模型语义增强方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253482.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0253482

References

[1] 交通运输部党组书记、部长刘伟主持召开“十五五”智慧交通发展专家座谈会[J]. 交通企业管理, 2025, 40(04): 7. LIU W, Secretary of the Party Leadership Group and Minister of the Ministry of Transport, presided over the expert symposium on smart transportation development for the 15th Five-Year Plan[J]. Transportation Enterprise Management, 2025, 40(04): 7.
[2] 朱玟谦. 面向复杂环境的车辆重识别关键技术研究[D].武汉: 武汉大学, 2022. ZHU W. Research on key technologies for vehicle re-identification in complex environments[D]. Wuhan: Wuhan University, 2022.
[3] CHEN S, YE M, HUANG Y, et al. Towards Effective Rotation Generalization in UAV Object Re-Identification[J]. IEEE Transactions on Information Forensics and Security, 2025.
[4] LIU X, LIU W, MA H, et al. Large-scale vehicle re-identification in urban surveillance videos[C]//2016 IEEE international conference on multimedia and expo (ICME). IEEE, 2016: 1-6.
[5] LOU Y, BAI Y, LIU J, et al. Veri-wild: A large dataset and a new method for vehicle re-identification in the wild[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 3235-3243.
[6] 窦鑫泽, 盛浩, 吕凯, 等. 基于高置信局部特征的车辆重识别优化算法[J]. 北京航空航天大学学报, 2020, 46(09): 1650-1659. DOU X, SHENG H, LV K, et al. Optimization algorithm for vehicle re identification based on high confidence local features[J]. Journal of Beijing University of Aeronautics and Astronautics, 2020, 46(09): 1650-1659.
[7] YE M, CHEN S, LI C, et al. Transformer for object re-identification: A survey[J]. International Journal of Computer Vision, 2025, 133(5): 2410-2440.
[8] WALMER M, SURI S, GUPTA K, et al. Teaching matters: Investigating the role of supervision in vision trans-formers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 7486-7496.
[9] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[10] HE S, LUO H, WANG P, et al. Transreid: Transformer-based object re-identification[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 15013-15022.
[11] 顾栋炼, 张宁, 王玺越, 等. 基于视觉基础模型的无人机航拍图像弱监督语义分割[J/OL].东南大学学报(自然科学版),1-12[2025-11-30].https://link.cnki.net/urlid/32.1178.N.20250925.1405.002.
GU D, ZHANG N, WANG X, et al. Weakly supervised semantic segmentation of UAV aerial images based on visual basic model[J/OL].Journal of Southeast University(Natural Science Edition),1-12[2025-11-30].https://link.cnki.net/urlid/32.1178.N.20250925.1405.002.
[12] CHEN S, YE M, DU B. Rotation invariant transformer for recognizing object in uavs[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 2565-2574.
[13] FERDOUS S N, LI X, LYU S. Uncertainty aware multi-task pyramid vision transformer for uav-based object re-identification[C]//2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022: 2381-2385.
[14] ZHANG S, YANG Q, CHENG D, et al. Ground-to-aerial person search: Benchmark dataset and approach[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 789-799.
[15] LIU X, LIU W, MEI T, et al. Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance[J]. IEEE Transactions on Multimedia, 2017, 20(3): 645-658.
[16] LIU H, TIAN Y, YANG Y, et al. Deep relative distance learning: Tell the difference between similar vehicles[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2167-2175.
[17] TENG S, ZHANG S, HUANG Q, et al. Viewpoint and scale consistency reinforcement for UAV vehicle re-identification[J]. International Journal of Computer Vision, 2021, 129(3): 719-735.
[18] WANG P, JIAO B, YANG L, et al. Vehicle re-identification in aerial imagery: Dataset and approach[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 460-469.
[19] HOLLA A B., PAI M M.M., VERMA U, et al. MSFFT: Multi-Scale Feature Fusion Transformer for cross plat-form vehicle re-identification[J]. Neurocomputing, 2024, 582: 127514.
[20] SHANG L, MIN C, WANG J, et al. Aerial-Ground Cross-View Vehicle Re-Identification: A Benchmark Dataset and Baseline[J]. Remote Sensing, 2025, 17(15): 2653.
[21] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[22] 平灿, 李雷孝, 刘东江, 等. 基于深度学习的车辆重识别研究进展[J]. 计算机工程与应用, 2025, 61(16): 16-37. PING C, LI L, LIU D, et al. Research progress on vehicle re-identification based on deep learning[J]. Computer Engineering and Applications, 2025, 61(16): 16-37.
[23] 孙圆, 王康平, 赵鸣博. 基于多提示和图文对比学习的服装检索[J/OL].计算机工程:1-9[2025-12-03].https://doi.org/10.19678/j.issn.1000-3428.0069773. SUN Y, WANG K, ZHAO M. Clothing retrieval based on multi prompt and image-text contrastive learning[J/OL].Computer Engineering:1-9[2025-12-03].https://doi.org/10.19678/j.issn.1000-3428.0069773.
[24] LI S, SUN L, LI Q. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels[C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(1): 1405-1413.
[25] 耿霞, 汪尧. 基于CLIP增强细粒度特征的换装行人重识别方法[J]. 计算机工程, 2025, 51(04): 293-302. GENG X, WANG Y. A pedestrian re-identification method for changing clothes based on CLIP enhanced fine-grained features[J]. Computer Engineering, 2025, 51(04): 293-302.
[26] BAI J, BAI S, CHU Y, et al. Qwen technical report[J]. arXiv preprint arXiv:2309.16609, 2023.
[27] ZHANG Q, WANG L, PATEL V M, et al. View-decoupled transformer for person re-identification under aerial-ground camera network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 22000-22009.
[28] LIU Y, KUANG Z, ZHANG H, et al. PORSCHE: progressive optimization and robust spatial convolution for hybrid enhancement in visible-infrared vehicle re-identification[J]. IEEE Internet of Things Journal, 2025.
[29] 罗慧诚, 汪淑娟. 基于特征鲁棒性增强的多摄像头下车辆识别方法[J]. 系统仿真学报, 2023, 35(05): 1059-1074. LUO H, WANG S. Multi camera vehicle recognition method based on feature robustness enhancement[J]. Journal of System Simulation, 2023, 35(05): 1059-1074.

Please choose a citation manager

Content to export