Prompt-based Information Transfer framework for Text-based Person Search

doi:10.19678/j.issn.1000-3428.0252150

Abstract

Abstract: In text-based person search tasks, initializing models with parameters from pre-training models has become a mainstream paradigm, which effectively alleviates the feature alignment bottleneck of single-modal models caused by the lack of cross-modal information. Existing methods focus on mining semantic features at different scales in the image-text joint embedding space for optimization. However, the introduction of the new alignment paradigm is prone to cause the pre-training model to fall into local minimum during fine-tuning. To solve above issues, this paper proposes a Prompt-based Information Transfer (PIT) framework. By introducing cross-modal prompt tokens in the original forward process of the single-modal encoder and the cross-modal image-text encoder, it promotes early feature fusion and implicitly guides the model to focus more on modal-invariant information. PIT includes a prompt-based contrastive loss and a prompt training strategy. The prompt-based contrastive loss aims to construct a shared feature embedding space with both intra-modal discrimination and inter-modal semantic consistency by constraining the similarity between graphic and text features. The prompt training strategy can be regarded as a form of self-distillation, which treats the pseudo-targets generated by non-prompt features and ground-truth as another view of image-text pair, supervising the training process and making the learned embeddings contain richer multi-modal information. Only 0.61M additional parameters introduced on the basis of fine-tuning, PIT achieves Rank-1 improvements of 1.48%, 1.5%, and 1.55% on three public datasets, respectively.

摘要： 在基于文本的行人重识别任务中，基于图文预训练模型的参数初始化已成为主流范式，这有效缓解了单模态模型因跨模态信息缺失导致的特征对齐瓶颈。现有方法聚焦于挖掘图像-文本联合嵌入空间中不同尺度下的语义特征进行优化，但新对齐范式的引入易使原模型在微调过程中陷入局部最优。为了解决上述问题，本文提出了一种基于提示的信息传输框架（PIT），通过在单模态编码器和跨模态图像文本编码器的原始前向过程中嵌入跨模态提示标识符，促进早期特征融合，隐式地引导模型更加聚焦于模态不变的信息。PIT包含基于跨模态提示的对比学习损失以及提示训练策略。基于跨模态提示的对比损失旨在通过约束图文特征之间的相似度，构建兼具模态内区分度与模态间语义一致性的共享特征嵌入空间。提示训练策略可以视为一种自蒸馏方法，通过将无提示特征与基准真相产生的伪目标视为另一种行人图文对的特征视图，监督跨模态提示特征的训练过程，使最终学习到的特征嵌入相较于无提示特征包含更丰富的多模态信息。PIT在完全微调的基础上仅需要添加0.61M的参数就能为模型分别在三个公共数据集上带来 1.48%、1.5% 和1.55% 的Rank-1提升。

GENG Xia, LIN Xianwen, YANG Zhi. Prompt-based Information Transfer framework for Text-based Person Search[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252150.

耿霞, 林贤文, 杨治. 基于提示信息传输框架的图文行人重识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252150.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252150

References

[1] Li S, Xiao T, Li H, et al. Person search with natural language description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1970-1979.
[2] Han X, Zhong X, Huang W, et al. (2024). See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification[J]. arXiv preprint arXiv: 2412.01345, 2024.
[3] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.
[4] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 2021, 34: 9694-9705.
[5] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International conference on machine learning. PMLR, 2022: 12888-12900.
[6] Jiang D, Ye M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2787-2797.
[7] Yan S, Dong N, Zhang L, et al. Clip-driven fine-grained text-image person re-identification[J]. IEEE Transactions on Image Processing, 2023.
[8] Liu Y, Li Y, Liu Z, et al. CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 7935-7939.
[9] Sun J, Fei H, Zheng Z, et al. From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search[J]. arXiv preprint arXiv:2404.10292, 2024.
[10] Yang S, Zhou Y, Zheng Z, et al. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023:4492-4501.
[11] Khattak M, Rasheed H, Maaz M, et al. MaPLe: Multi-modal Prompt Learning.[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 19113-19122.
[12] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738.
[13] Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv:1807.03748, 2018.
[14] Ding Z, Ding C, Shao Z, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[J]. arXiv preprint arXiv:2107.12666, 2021.
[15] Zhu A, Wang Z, Li Y, et al. Dssl: Deep surroundings-person separation learning for text-based person retrieval[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 209-217.
[16] Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16(2): 1-23.
[17] 姜定,叶茫.面向跨模态文本到图像行人重识别的 Transformer 网络 [J]. 中国图象图形学报,2023,28(05):1384-1395. Jiang Ding, Ye Mang. Transformer network for cross-modal text-to-image person re-identification[J]. Journal of Image and Graphics,2023,28(05):1384-1395.
[18] Yan S, Dong N, Liu J, et al. Learning comprehensive representations with richer self for text-to-image person re-identification[C]//Proceedings of the 31st ACM international conference on multimedia. 2023: 6202-6211.
[19] Wang Z, Zhu A, Xue J, et al. Caibc: Capturing all-round information beyond color for text-based person retrieval[C]//Proceedings of the 30th ACM international conference on multimedia. 2022: 5314-5322.
[20] 王晋溪,鲁鸣鸣.基于场景图知识的文本到图像行人重识别 [J]. 模式识别与人工智能,2024,37(11):947-959.DOI:10.16451/j.cnki.issn1003-60 59.202411001. WANG Jinxi, LU Mingming.Scene Graph Knowledge Based Text-to-Image Person Re-identification[J]. Pattern Recognition and Artificial Intelligence,2024,37(11):947-959.DOI:10.16451/j.cnki.iss n1003-6059.202411001.
[21] Bai Y, Cao M, Gao D, et al. Rasa: Relation and sensitivity aware representation learning for text-based person search[J]. arXiv preprint arXiv:2305.13653, 2023.
[22] Lin D, Peng Y X, Meng J, et al. Cross-modal adaptive dual association for text-to-image person retrieval[J]. IEEE Transactions on Multimedia, 2024, 26: 6609-6620.
[23] Ye M, Shen J, Lin G, et al. Deep learning for person re-identification: A survey and outlook[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 44(6): 2872-2893.
[24] Chen Y, Zhang G, Lu Y, et al. TIPCB: A simple but effective part-based convolutional baseline for text-based person search[J]. Neurocomputing, 2022, 494: 171-181.
[25] Fujii T, Tarashima S. Bilma: Bidirectional local-matching for text-based person re-identification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 2786-2790.
[26] Qin Y, Chen Y, Peng D, et al. Noisy-correspondence learning for text-to-image person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 27197-27206.
[27] Cao M, Bai Y, Zeng Z, et al. An empirical study of clip for text-based person search[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(1): 465-473.
[28] Ergasti A, Fontanini T, Ferrari C, et al. MARS: Paying more attention to visual attributes for text-based person search[J]. arXiv preprint arXiv:2407.04287, 2024.
[29] Deng Y, Hu Z, Han J, et al. DualFocus: A Unified Framework for Integrating Positive and Negative Descriptors in Text-based Person Retrieval[J]. arXiv preprint arXiv:2405.07459, 2024.

Please choose a citation manager

Content to export