| 1 | LI S, XIAO T, LI H S, et al. Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2017: 1970-1979. URL
 | 
																													
																							| 2 | LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2201.12086 . | 
																													
																							| 3 |  GOODFELLOW I ,  POUGET-ABADIE J ,  MIRZA M , et al.  Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144.  doi: 10.1145/3422622
 | 
																													
																							| 4 |  | 
																													
																							| 5 | JOSHI V, PETERS M, HOPKINS M. Extending a parser to distant domains using a few dozen partially annotated examples[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1805.06556 . | 
																													
																							| 6 |  CHEN D P ,  LI H S ,  LIU X H , et al.  Improving deep visual representation for person re-identification by global and local image-language association. Berlin, Germany: Springer International Publishing, 2018.  URL
 | 
																													
																							| 7 | DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.12666 . | 
																													
																							| 8 |  | 
																													
																							| 9 | GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2101.03036 . | 
																													
																							| 10 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1810.04805 . | 
																													
																							| 11 |  LECUN Y ,  BOTTOU L ,  BENGIO Y , et al.  Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86 (11): 2278- 2324.  doi: 10.1109/5.726791
 | 
																													
																							| 12 | LI S P, CAO M, ZHANG M. Learning semantic-aligned feature representation for text-based person search[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D.C., USA: IEEE Press, 2022: 2724-2728. URL
 | 
																													
																							| 13 |  | 
																													
																							| 14 | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2103.00020 . | 
																													
																							| 15 |  | 
																													
																							| 16 | SHARMA P, DING N, GOODMAN S, et al. Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2018: 1-10. URL
 | 
																													
																							| 17 | ZHU F D, ZHU Y, CHANG X J, et al. Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 10009-10019. URL
 | 
																													
																							| 18 |  MAJUMDAR A ,  SHRIVASTAVA A ,  LEE S , et al.  Improving vision-and-language navigation with image-text pairs from the Web. Berlin, Germany: Springer International Publishing, 2020. | 
																													
																							| 19 |  LECUN Y ,  BOSER B ,  DENKER J S , et al.  Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1 (4): 541- 551.  doi: 10.1162/neco.1989.1.4.541
 | 
																													
																							| 20 |  | 
																													
																							| 21 |  | 
																													
																							| 22 |  | 
																													
																							| 23 | PODELL D, ENGLISH Z, LACEY K, et al. SDXL: improving latent diffusion models for high-resolution image synthesis[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2307.01952 . | 
																													
																							| 24 |  | 
																													
																							| 25 |  | 
																													
																							| 26 | PALATUCCI M, POMERLEAU D, HINTON G, et al. Zero-shot learning with semantic output codes[C]//Proceedings of the 22nd International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2009: 1410-1418. URL
 | 
																													
																							| 27 |  | 
																													
																							| 28 | KIRSTAIN Y, POLYAK A, SINGER U, et al. Pick-a-pic: an open dataset of user preferences for text-to-image generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2305.01569 . | 
																													
																							| 29 | LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2301.12597 . | 
																													
																							| 30 | ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 209-217. URL
 | 
																													
																							| 31 | WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for all vision and vision-language tasks[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2208.10442 . | 
																													
																							| 32 | LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.07651 . | 
																													
																							| 33 | JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2303.12501 . | 
																													
																							| 34 | ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 3813-3824. URL
 | 
																													
																							| 35 |  |