Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Large Model-Based Semantic Enhancement for Cross-View Vehicle Re-identification

  

  • Published:2026-05-26

面向跨视角车辆重识别的大模型语义增强方法

Abstract: In traffic monitoring and public security scenarios, relying solely on ground-view or aerial-view vehicle re-identification often fails to meet the requirements of large-scale, complex, and multi-scene perception. Ground-view images contain rich visual details but suffer from limited field of view and frequent occlusions, whereas aerial views offer wide-area coverage but usually depict vehicles with small sizes and insufficient details, leading to degraded recognition performance. Therefore, fusing ground and aerial viewpoints for cross-view vehicle ReID has become a key research direction for enhancing large-scale traffic perception. However, this task is confronted with several challenges, including severe scale variations, large cross-view appearance discrepancies, intra-class distances exceeding inter-class distances, and limited cross-scene data. To this end, we propose a large model-based semantic enhancement method for cross-view vehicle re-identification. Built upon the CLIP-ReID multimodal framework, the proposed approach first employs Qwen-VL-Plus multimodal large model to automatically generate fine-grained structured descriptions for vehicle images, and then leverages Qwen-Max language large model to fuse semantic information from ground and aerial viewpoints, yielding a unified and stable cross-view semantic representation. This representation is further injected into a two-stage image–text contrastive learning scheme to strengthen the model’s domain generalization ability under cross-scene and cross-platform conditions. To promote practical deployment and subsequent research, we also construct a cross-view ground–aerial vehicle image dataset covering multiple flight altitudes, acquisition devices, and scene conditions, and design domain-generalization-oriented data splits and evaluation protocols as a new benchmark. Experimental results demonstrate that the proposed method significantly outperforms pure visual baselines on multiple metrics and achieves superior performance to state-of-the-art algorithms in cross-scene domain generalization tests, validating the effectiveness of semantic enhancement for cross-view vehicle re-identification. The proposed method shows strong application potential and engineering value in intelligent traffic surveillance, UAV-based patrol, and regional security.

摘要: 在交通监测与公共安全场景中,仅依赖地面或空中单一视角的车辆重识别往往难以满足广域、复杂、多场景的识别需求。地面视角虽然图像细节丰富,但视野受限且易受遮挡;空中视角具备大范围监视优势,却常因目标尺寸小、细节不足而造成识别性能下降。因此,融合地空视角开展跨视角车辆重识别,已成为提升大规模交通感知能力的研究热点。然而,该任务同时面临尺度变化剧烈、跨视角外观差异大、类内距离显著大于类间距离以及跨场景数据有限等挑战。为此,本文提出一种面向跨视角车辆重识别的大模型语义增强方法。方法基于CLIP-ReID多模态框架,首先利用Qwen-VL-Plus多模态大模型生成车辆图像的细粒度结构化描述,并借助Qwen-Max语言大模型融合来自地面与空中不同视角的语义信息,形成统一、稳定的跨视角语义表示。随后,将这一语义表示显式注入到两阶段图文对比学习中,以增强模型在跨场景、跨平台条件下的域泛化能力。为推动该方向的工程落地与后续研究,本文还构建了覆盖多种飞行高度、采集设备与场景条件的跨视角地空车辆图像数据集,并设计跨场景域泛化的数据划分与评测方案,为研究者提供新的标准测试基准。实验结果显示,所提方法在多项指标上显著优于纯视觉基线模型,特别是在跨场景域泛化测试中的表现领先于现有先进算法,验证了语义增强在跨视角识别任务中的有效性。该方法在智能交通监控、无人机巡查、区域安防等场景具有良好的应用前景和工程价值。