1 |
LI S, XIAO T, LI H S, et al. Person search with natural language description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2017: 1970-1979.
URL
|
2 |
LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2201.12086.
|
3 |
GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144.
doi: 10.1145/3422622
|
4 |
|
5 |
JOSHI V, PETERS M, HOPKINS M. Extending a parser to distant domains using a few dozen partially annotated examples[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1805.06556.
|
6 |
CHEN D P , LI H S , LIU X H , et al. Improving deep visual representation for person re-identification by global and local image-language association. Berlin, Germany: Springer International Publishing, 2018.
URL
|
7 |
DING Z F, DING C X, SHAO Z Y, et al. Semantically self-aligned network for text-to-image part-aware person re-identification[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.12666.
|
8 |
|
9 |
GAO C Y, CAI G Y, JIANG X Y, et al. Contextual non-local alignment over full-scale representation for text-based person search[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2101.03036.
|
10 |
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2023-10-07]. http://arxiv.org/abs/1810.04805.
|
11 |
LECUN Y , BOTTOU L , BENGIO Y , et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86 (11): 2278- 2324.
doi: 10.1109/5.726791
|
12 |
LI S P, CAO M, ZHANG M. Learning semantic-aligned feature representation for text-based person search[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D.C., USA: IEEE Press, 2022: 2724-2728.
URL
|
13 |
|
14 |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2103.00020.
|
15 |
|
16 |
SHARMA P, DING N, GOODMAN S, et al. Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2018: 1-10.
URL
|
17 |
ZHU F D, ZHU Y, CHANG X J, et al. Vision-language navigation with self-supervised auxiliary reasoning tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 10009-10019.
URL
|
18 |
MAJUMDAR A , SHRIVASTAVA A , LEE S , et al. Improving vision-and-language navigation with image-text pairs from the Web. Berlin, Germany: Springer International Publishing, 2020.
|
19 |
LECUN Y , BOSER B , DENKER J S , et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1 (4): 541- 551.
doi: 10.1162/neco.1989.1.4.541
|
20 |
|
21 |
|
22 |
|
23 |
PODELL D, ENGLISH Z, LACEY K, et al. SDXL: improving latent diffusion models for high-resolution image synthesis[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2307.01952.
|
24 |
|
25 |
|
26 |
PALATUCCI M, POMERLEAU D, HINTON G, et al. Zero-shot learning with semantic output codes[C]//Proceedings of the 22nd International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2009: 1410-1418.
URL
|
27 |
|
28 |
KIRSTAIN Y, POLYAK A, SINGER U, et al. Pick-a-pic: an open dataset of user preferences for text-to-image generation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2305.01569.
|
29 |
LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2301.12597.
|
30 |
ZHU A C, WANG Z J, LI Y F, et al. DSSL: deep surroundings-person separation learning for text-based person retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 209-217.
URL
|
31 |
WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for all vision and vision-language tasks[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2208.10442.
|
32 |
LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2107.07651.
|
33 |
JIANG D, YE M. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval[EB/OL]. [2023-10-07]. http://arxiv.org/abs/2303.12501.
|
34 |
ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 3813-3824.
URL
|
35 |
|