Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Prompt-based Information Transfer framework for Text-based Person Search

  

  • Published:2025-06-03

基于提示信息传输框架的图文行人重识别

Abstract: In text-based person search tasks, initializing models with parameters from pre-training models has become a mainstream paradigm, which effectively alleviates the feature alignment bottleneck of single-modal models caused by the lack of cross-modal information. Existing methods focus on mining semantic features at different scales in the image-text joint embedding space for optimization. However, the introduction of the new alignment paradigm is prone to cause the pre-training model to fall into local minimum during fine-tuning. To solve above issues, this paper proposes a Prompt-based Information Transfer (PIT) framework. By introducing cross-modal prompt tokens in the original forward process of the single-modal encoder and the cross-modal image-text encoder, it promotes early feature fusion and implicitly guides the model to focus more on modal-invariant information. PIT includes a prompt-based contrastive loss and a prompt training strategy. The prompt-based contrastive loss aims to construct a shared feature embedding space with both intra-modal discrimination and inter-modal semantic consistency by constraining the similarity between graphic and text features. The prompt training strategy can be regarded as a form of self-distillation, which treats the pseudo-targets generated by non-prompt features and ground-truth as another view of image-text pair, supervising the training process and making the learned embeddings contain richer multi-modal information. Only 0.61M additional parameters introduced on the basis of fine-tuning, PIT achieves Rank-1 improvements of 1.48%, 1.5%, and 1.55% on three public datasets, respectively.

摘要: 在基于文本的行人重识别任务中,基于图文预训练模型的参数初始化已成为主流范式,这有效缓解了单模态模型因跨模态信息缺失导致的特征对齐瓶颈。现有方法聚焦于挖掘图像-文本联合嵌入空间中不同尺度下的语义特征进行优化,但新对齐范式的引入易使原模型在微调过程中陷入局部最优。为了解决上述问题,本文提出了一种基于提示的信息传输框架(PIT),通过在单模态编码器和跨模态图像文本编码器的原始前向过程中嵌入跨模态提示标识符,促进早期特征融合,隐式地引导模型更加聚焦于模态不变的信息。PIT包含基于跨模态提示的对比学习损失以及提示训练策略。基于跨模态提示的对比损失旨在通过约束图文特征之间的相似度,构建兼具模态内区分度与模态间语义一致性的共享特征嵌入空间。提示训练策略可以视为一种自蒸馏方法,通过将无提示特征与基准真相产生的伪目标视为另一种行人图文对的特征视图,监督跨模态提示特征的训练过程,使最终学习到的特征嵌入相较于无提示特征包含更丰富的多模态信息。PIT在完全微调的基础上仅需要添加0.61M的参数就能为模型分别在三个公共数据集上带来 1.48%、1.5% 和1.55% 的Rank-1提升。