大小模型协同的两阶段行人重识别方法

doi:10.19678/j.issn.1000-3428.0253274

摘要/Abstract

摘要： 行人重识别任务易受视角、姿态、遮挡等复杂因素干扰。现有的主流深度学习方法主要依赖视觉特征的统计相似性进行匹配。这类方法在通用场景下表现优异，但普遍缺乏高层语义理解能力与逻辑推理机制，导致其在面对外观相似的硬样本时难以捕捉细粒度差距，从而陷入精度瓶颈。针对上述问题，本文提出一种大小模型协同的两阶段行人重识别方法，旨在融合专用小模型的高效性与通用多模态大模型的强判别力。第一阶段为快速召回阶段，结合轻量级深度学习模型与K-互近邻方法对待识别的行人进行快速召回，从海量图库中筛选出少量与目标行人具有高相关度的候选集，在保证高召回率的同时大幅降低后续处理的数据规模。第二阶段为精确筛选阶段，将预训练的多模态大模型视为判别器，利用其强大的多模态理解能力对候选集进行精确筛选。采用这种大小模型协同的两阶段识别方法能够兼顾模型的速度和精度。在Market-1501和DukeMTMC-reID数据集上，所提出方法的Rank-1准确率分别达到98.5%和96.5%，较CLIP-ReID方法分别提升2.8%和6.5%，这充分验证了所提出方法的有效性。

Abstract: Person Re-identification (Re-ID) is frequently challenged by complex factors such as variations in viewpoint, pose, and occlusion. Existing mainstream deep learning methods primarily rely on the statistical similarity of visual features for matching. While these methods perform well in general scenarios, they often lack high-level semantic understanding and logical reasoning mechanisms. Consequently, they struggle to capture fine-grained differences when distinguishing "hard samples" with similar appearances, leading to accuracy bottlenecks. To address these issues, this paper proposes a two-stage Re-ID method featuring a collaboration between small and large models, designed to integrate the efficiency of specialized small models with the robust discriminative power of general Multimodal Large Language Models (MLLMs). The first stage is a rapid recall phase, where a lightweight deep learning model is combined with the K-reciprocal nearest neighbor algorithm to retrieve candidates. This stage filters a small set of highly relevant candidates from the massive gallery, significantly reducing the data scale for subsequent processing while ensuring a high recall rate. The second stage is a precise refinement phase, where a pre-trained MLLM serves as a discriminator to accurately screen the candidate set by leveraging its powerful multimodal understanding capabilities. This collaborative two-stage approach effectively balances inference speed and recognition accuracy. Experimental results on the Market-1501 and DukeMTMC-reID datasets demonstrate that the proposed method achieves Rank-1 accuracies of 98.5% and 96.5%, respectively. These results represent significant improvements of 2.8% and 6.5% over the CLIP-ReID method, fully validating the effectiveness of the proposed approach.

殷伟梁, 刘冰, 罗善军, 黄亮, 陈晓慧. 大小模型协同的两阶段行人重识别方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253274.

YIN Weiliang, Liu Bing, Luo Shanjun, Huang Liang, Chen Xiaohui. Two-Stage Person Re-identification Method with Collaboration between Large and Small Models[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253274.

参考文献

[1] 罗浩, 姜伟, 范星, 等. 基于深度学习的行人重识别研究进展[J]. 自动化学报, 2019, 45(11): 2032-2049. LUO Hao, JIANG Wei, FAN Xing, et al. Research progress of deep learning-based person re-identification[J]. Acta Automatica Sinica, 2019, 45(11): 2032-2049.
[2] HE S, LUO H, WANG P, et al. TransReID: Transformer-based object re-identification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE Press, 2021: 15013-15022.
[3] 李庆, 胡卫洋, 李江云, 等. 基于深度学习的行人重识别方法综述[J]. 工程科学学报, 2022, 44(5): 920-932. LI Qing, HU Weiyang, LI Jiangyun, et al. A survey of deep learning-based person re-identification methods[J]. Chinese Journal of Engineering, 2022, 44(5): 920-932.
[4] 祁磊, 于沛泽, 高阳. 弱监督场景下的行人重识别研究综述[J]. 软件学报, 2020, 31(9): 2883-2902. QI Lei, YU Peize, GAO Yang. A survey of person re-identification in weakly supervised scenarios[J]. Journal of Software, 2020, 31(9): 2883-2902.
[5] 叶钰, 王正, 梁超, 等. 多源数据行人重识别研究综述[J]. 自动化学报, 2020, 46(9): 1869-1884. YE Yu, WANG Zheng, LIANG Chao, et al. A survey of multi-source data person re-identification[J]. Acta Automatica Sinica, 2020, 46(9): 1869-1884.
[6] ZHAO Y, RUAN W, LI H, et al. NightReID: A large-scale nighttime person re-identification benchmark[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia, PA, USA: AAAI Press, 2025:10519-10527.
[7] 姚足, 龚勋, 陈锐, 等. 面向行人重识别的局部特征研究进展、挑战与展望[J]. 自动化学报, 2021, 47(12): 2742-2760. YAO Zu, GONG Xun, CHEN Rui, et al. Progress, challenges and prospects of local feature research for person re-identification[J]. Acta Automatica Sinica, 2021, 47(12): 2742-2760.
[8] 闫铭, 李雷孝, 林浩, 等. 少样本行人重识别研究综述[J]. 计算机工程与应用, 2025, 61(17): 62-88. YAN Ming, LI Leixiao, LIN Hao, et al. A survey of few-shot person re-identification[J]. Computer Engineering and Applications, 2025, 61(17): 62-88.
[9] 冯展祥, 赖剑煌, 袁藏, 等. 走向通用行人重识别：预训练大模型技术在行人重识别的应用综述[J]. 中国图象图形学报, 2025, 30(6): 1638-1660. FENG Zhanxiang, LAI Jianhuang, YUAN Cang, et al. Towards general person re-identification: A survey of applications of large-scale pre-trained models in person re-identification[J]. Journal of Image and Graphics, 2025, 30(6): 1638-1660.
[10] LI Y, SUN L, LI Q, et al. CLIP-ReID: Exploiting vision-language model for image re-identification without concrete text labels[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI Press, 2023: 1405-1413.
[11] YANG S, ZHANG Y. MLLMReID: Multimodal large language model-based person re-identification[EB/OL]. (2024-01-22)[2025-10-31]. https://arxiv.org/abs/2401.13201.
[12] SUN Y, ZHENG L, YANG Y, et al. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)[C]//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018: 480-496.
[13] WANG G, YUAN Y, CHEN X, et al. Learning discriminative features with multiple granularities for person re-identification[C]//Proceedings of the 26th ACM International Conference on Multimedia (MM). Seoul, Republic of Korea: ACM Press, 2018: 274-282.
[14] ZHONG Z, ZHENG L, CAO D, et al. Re-ranking person re-identification with k-reciprocal encoding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE Press, 2017: 1318-1327.
[15] BHUIYAN A, ISLAM M S, ALAM M, et al. Evolution of ReID: From Early Methods to LLM Integration[EB/OL]. (2025-06-18)[2025-10-31]. https://arxiv.org/abs/2506.13039.
[16] ZHANG J, HUANG J, JIN S, et al. Vision-language models for vision tasks: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5625-5644.
[17] YANG Z, WU D, WU C, et al. A pedestrian is worth one prompt: Towards language guidance person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE Press, 2024: 17343-17353.
[18] TAN W, DING C, JIANG J, et al. Harnessing the power of MLLMs for transferable text-to-image person reID[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE Press, 2024: 1251-1261.
[19] PENG W, ZHU L, ZHANG Y, et al. Synthesize, diagnose, and optimize: Towards fine-grained vision-language understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE Press, 2024: 21979-21989.
[20] BAI J, BAI S, YANG S, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond[EB/OL]. (2023-08-24)[2025-10-31]. https://arxiv.org/abs/2308.12966.
[21] 黄嘉恒, 李晓伟, 陈本辉, 等. 基于哈希的图像相似度算法比较研究[J]. 大理大学学报, 2017, 2(12): 32-37. HUANG Jiaheng, LI Xiaowei, CHEN Benhui, et al. Comparative study of hash-based image similarity algorithms[J]. Journal of Dali University, 2017, 2(12): 32-37.
[22] JOCHER G, CHAURASIA A, QIU J. YOLOv8: Real-time object detection and tracking[EB/OL]. (2023-01-20)[2025-10-31]. https://doi.org/10.5281/zenodo.7550252.
[23] ZHOU K, YANG Y, CIMPOI A, et al. Learning generalisable omni-scale representations for person re-identification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5056-5069.
[24] GU Z, ZHU B, ZHU G, et al. AnomalyGPT: Detecting industrial anomalies using large vision-language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2024: 1932-1940.
[25] CAO J, LI M, WEN M, et al. A study on prompt design, advantages and limitations of ChatGPT for deep learning program repair[J]. Automated Software Engineering, 2025, 32(1): 1-29.
[26] ZHENG L, SHEN L, TIAN L, et al. Scalable person re-identification: A benchmark[J]. IEEE Transactions on Image Processing, 2015, 24(11): 4110-4121.
[27] ZHENG Z, ZHENG L, YANG Y. Unlabeled samples generated by generative adversarial networks: A new data augmentation approach for person re-identification[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE Press, 2017: 3754-3762.
[28] RISTANI E, SOLERA F, ZOU R S, et al. Performance Measures and a Data Set for Multi-target, Multi-camera Tracking[C]//Computer Vision – ECCV 2016 Workshops. Amsterdam, The Netherlands: Springer, 2016: 17-35.
[29] WANG J, WANG J. MHDNet: A multi-scale hybrid deep learning model for person re-identification[J]. Electronics, 2024, 13(8): 1435.
[30] WANG P Y, ZHAO Z C, SU F, et al. LTReID: Factorizable feature generation with independent components for long-tailed person re-identification[J]. IEEE Transactions on Multimedia, 2022, 25: 4610-4622.
[31] CHE Q-H, NGUYEN L-C, LUU D-T, et al. Enhancing person re-identification via Uncertainty Feature Fusion Method and Auto-weighted Measure Combination[J]. Knowledge-Based Systems, 2025, 307: 112737.

选择文件类型/文献管理软件名称

选择包含的内容