作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

大小模型协同的两阶段行人重识别方法

  • 发布日期:2026-04-02

Two-Stage Person Re-identification Method with Collaboration between Large and Small Models

  • Published:2026-04-02

摘要: 行人重识别任务易受视角、姿态、遮挡等复杂因素干扰。现有的主流深度学习方法主要依赖视觉特征的统计相似性进行匹配。这类方法在通用场景下表现优异,但普遍缺乏高层语义理解能力与逻辑推理机制,导致其在面对外观相似的硬样本时难以捕捉细粒度差距,从而陷入精度瓶颈。针对上述问题,本文提出一种大小模型协同的两阶段行人重识别方法,旨在融合专用小模型的高效性与通用多模态大模型的强判别力。第一阶段为快速召回阶段,结合轻量级深度学习模型与K-互近邻方法对待识别的行人进行快速召回,从海量图库中筛选出少量与目标行人具有高相关度的候选集,在保证高召回率的同时大幅降低后续处理的数据规模。第二阶段为精确筛选阶段,将预训练的多模态大模型视为判别器,利用其强大的多模态理解能力对候选集进行精确筛选。采用这种大小模型协同的两阶段识别方法能够兼顾模型的速度和精度。在Market-1501和DukeMTMC-reID数据集上,所提出方法的Rank-1准确率分别达到98.5%和96.5%,较CLIP-ReID方法分别提升2.8%和6.5%,这充分验证了所提出方法的有效性。

Abstract: Person Re-identification (Re-ID) is frequently challenged by complex factors such as variations in viewpoint, pose, and occlusion. Existing mainstream deep learning methods primarily rely on the statistical similarity of visual features for matching. While these methods perform well in general scenarios, they often lack high-level semantic understanding and logical reasoning mechanisms. Consequently, they struggle to capture fine-grained differences when distinguishing "hard samples" with similar appearances, leading to accuracy bottlenecks. To address these issues, this paper proposes a two-stage Re-ID method featuring a collaboration between small and large models, designed to integrate the efficiency of specialized small models with the robust discriminative power of general Multimodal Large Language Models (MLLMs). The first stage is a rapid recall phase, where a lightweight deep learning model is combined with the K-reciprocal nearest neighbor algorithm to retrieve candidates. This stage filters a small set of highly relevant candidates from the massive gallery, significantly reducing the data scale for subsequent processing while ensuring a high recall rate. The second stage is a precise refinement phase, where a pre-trained MLLM serves as a discriminator to accurately screen the candidate set by leveraging its powerful multimodal understanding capabilities. This collaborative two-stage approach effectively balances inference speed and recognition accuracy. Experimental results on the Market-1501 and DukeMTMC-reID datasets demonstrate that the proposed method achieves Rank-1 accuracies of 98.5% and 96.5%, respectively. These results represent significant improvements of 2.8% and 6.5% over the CLIP-ReID method, fully validating the effectiveness of the proposed approach.