1 |
NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Meaning. Washington D. C., USA: IEEE Press, 2011: 689-696.
|
2 |
SRIVASTAVA N, SALAKHUTDINOV R. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 2014, 15, 2949- 2980.
doi: 10.1162/NECO_a_00311
|
3 |
XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2015: 2048-2057.
|
4 |
LIU Z, LI J G, SHEN Z Q, et al. Learning efficient convolutional networks through network slimming[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 2736-2744.
|
5 |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of NIPS'17. Cambridge, USA: MIT Press, 2017: 30-39.
|
6 |
XU P, ZHU X T, CLIFTON D A. Multimodal learning with transformers: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (10): 12113- 12132.
doi: 10.1109/TPAMI.2023.3275156
|
7 |
吴志强, 解庆, 李琳, 等. 基于多模态融合的图神经网络推荐算法. 计算机工程, 2024, 50 (1): 91- 100.
doi: 10.19678/j.issn.1000-3428.0066929
|
|
WU Z Q, XIE Q, LI L, et al. Graph neural network recommendation algorithm based on multimodal fusion. Computer Engineering, 2024, 50 (1): 91- 100.
doi: 10.19678/j.issn.1000-3428.0066929
|
8 |
LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks. Berlin, Germany: Springer, 2020.
|
9 |
CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-TExt representation learning[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 104-120.
|
10 |
KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision[C]// Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2021: 5583-5594.
|
11 |
|
12 |
TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics[EB/OL]. [2024-02-01]. https://arxiv.org/abs/1906.00295.
|
13 |
LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL]. [2024-02-01]. https://arxiv.org/abs/1908.02265.
|
14 |
LI R L, YANG S, ROSS D A, et al. AI choreographer: music conditioned 3D dance generation with AIST++[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 13401-13412.
|
15 |
|
16 |
|
17 |
NAGRANI A, YANG S, ARNAB A, et al. Attention bottlenecks for multimodal fusion[C]//Proceedings of NIPS'21. Cambridge, USA: MIT Press, 2021: 14200-14213.
|
18 |
WANG Y K, CHEN X H, CAO L L, et al. Multimodal token fusion for vision transformers[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE Press, 2022: 12186-12195.
|
19 |
LI J N, LI D X, XIONG C M, et al. Blip: bootstrap language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of the 39th International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2022: 12888-12900.
|
20 |
|
21 |
|
22 |
DESAI K R, JOHNSON J. VirTex: learning visual representations from textual annotations[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 11162-11173.
|
23 |
ZHANG Y, JIANG H, MIURA Y, et al. Contrastive learning of medical visual representations from paired images and text[EB/OL]. [2024-02-01]. http://arxiv.org/abs/2010.00747.
|
24 |
LI J N. Align before fuse: vision and language representation learning with momentum distillation[C]//Proceedings of NIPS'21. Cambridge, USA: MIT Press, 2021: 9694-9705.
|
25 |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2021: 8748-8763.
|
26 |
GAO P, GENG S J, ZHANG R R, et al. CLIP-adapter: better vision-language models with feature adapters. International Journal of Computer Vision, 2024, 132 (2): 581- 595.
doi: 10.1007/s11263-023-01891-x
|
27 |
|
28 |
刘萌, 齐孟津, 詹圳宇, 等. 基于深度学习的图像-文本匹配研究综述. 计算机学报, 2023, 46 (11): 2370- 2399.
doi: 10.11897/SP.J.1016.2023.02370
|
|
LIU M, QI M J, ZHAN Z Y, et al. A survey on deep learning based image-text matching. Chinese Journal of Computers, 2023, 46 (11): 2370- 2399.
doi: 10.11897/SP.J.1016.2023.02370
|
29 |
|
30 |
WANG J Y, CHAN K C K, LOY C C. Exploring CLIP for assessing the look and feel of images. Artificial Intelligence, 2023, 37 (2): 2555- 2563.
doi: 10.48550/arXiv.2207.12396
|
31 |
|
32 |
|
33 |
GAO Z, LIU J, CHEN S, et al. Clip2tv: an empirical study on transformer-based methods for video-text retrieval[EB/OL]. [2024-02-01]. http://arxiv.org/abs/2111.05610.
|
34 |
TANG M K, WANG Z Y, LIU Z H, et al. CLIP4Caption: clip for video caption[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 4858-4862.
|
35 |
赵宏, 陈志文, 郭岚, 等. 基于ViT与语义引导的视频内容描述生成. 计算机工程, 2023, 49 (5): 247- 254.
doi: 10.19678/j.issn.1000-3428.0064409
|
|
ZHAO H, CHEN Z W, GUO L, et al. Video content caption generation based on ViT and semantic guidance. Computer Engineering, 2023, 49 (5): 247- 254.
doi: 10.19678/j.issn.1000-3428.0064409
|
36 |
张颖, 张冰冰, 董微, 等. 基于语言-视觉对比学习的多模态视频行为识别方法. 自动化学报, 2024, 50 (2): 417- 430.
|
|
ZHANG Y, ZHANG B B, DONG W, et al. Multi-modal video action recognition method based on language-visual contrastive learning. Journal of Automatica Sinica, 2024, 50 (2): 417- 430.
|
37 |
LI Y Z, ZHANG D, MU Y D. Visual-semantic matching by exploring high-order attention and distraction[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE Press, 2020: 12783-12792.
|