[1] Liu D, Qu X, Hu W. Reducing the vision and language bias for temporal sentence grounding[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisbon, Portugal: ACM, 2022: 4092-4101.
[2] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, Louisiana, USA: IEEE, 2022: 10684-10695.
[3] Wang W, Lv Q, Yu W, et al. Cogvlm: Visual expert for pretrained language models[J]. Advances in Neural Information Processing Systems, 2024, 37: 121475-121499.
[4] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报, 2020, 32(2): 327-348.
Du Pengfei, Li Xiaoyong, and Gao Yali. A Survey on Multimodal Vision-Language Representation Learning[J]. Journal of Software, 2020, 32(2): 327-348.
[5] Xu P, Shao W, Zhang K, et al. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[6] Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023.
[7] 陈光, 郭军. 大语言模型时代的人工智能: 技术内涵, 行业应用与挑战[J]. 北京邮电大学学报, 2024, 47(4): 20.
Chen Guang and Guo Jun. Artificial Intelligence in the Era of Large Language Models: Technical Connotations, Industrial Applications, and Challenges[J]. Journal of Beijing University of Posts and Telecommunications, 2024, 47(4): 20.
[8] Li Z, Wu Y, Chen Y, et al. Membership inference attacks against large vision-language models[J]. Advances in Neural Information Processing Systems, 2024, 37: 98645-98674.
[9] Li L, Guan H, Qiu J, et al. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Washington, USA: IEEE / Computer Vision Foundation, 2024: 24408-24419.
[10] 付志远, 陈思宇, 陈骏帆, 等. 大语言模型安全的挑战与机遇[J]. 信息安全学报, 2024, 9(5): 26-55.
Fu Zhiyuan, Chen Siyu, Chen Junfan, et al. Challenges and Opportunities of Large Language Model Security[J]. Journal of Information Security, 2024, 9(5): 26-55.
[11] Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv preprint arXiv:1312.6199, 2013.
[12] Zhao Y, Pang T, Du C, et al. On evaluating adversarial robustness of large vision-language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 54111-54138.
[13] Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint arXiv:1412.6572, 2014.
[14] Kurakin A, Goodfellow I, Bengio S. Adversarial machine learning at scale[J]. arXiv preprint arXiv:1611.01236, 2016.
[15] Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C]//2017 ieee symposium on security and privacy (sp). Ieee, 2017: 39-57.
[16] Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks[J]. arXiv preprint arXiv:1706.06083, 2017.
[17] Chen P Y, Zhang H, Sharma Y, et al. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models[C]//Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. Dallas, Texas, USA: ACM, 2017: 15-26.
[18] Ilyas A, Engstrom L, Athalye A, et al. Black-box adversarial attacks with limited queries and information[C]//International conference on machine learning. PMLR, 2018: 2137-2146.
[19] Papernot N, McDaniel P, Goodfellow I, et al. Practical black-box attacks against machine learning[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. Abu Dhabi, United Arab Emirates: ACM, 2017: 506-519.
[20] Xiong Y, Lin J, Zhang M, et al. Stochastic variance reduced ensemble adversarial attack for boosting the adversarial transferability[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, Louisiana, USA: IEEE / Computer Vision Foundation, 2022: 14983-14992.
[21] Chen B, Yin J, Chen S, et al. An adaptive model ensemble adversarial attack for boosting adversarial transferability[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE / Computer Vision Foundation, 2023: 4489-4498.
[22] Zhang J, Huang J, Jin S, et al. Vision-language models for vision tasks: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[23] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742.
[24] Zhu D, Chen J, Shen X, et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint arXiv:2304.10592, 2023.
[25] Liu H, Li C, Li Y, et al. Improved baselines with visual instruction tuning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Washington, USA: IEEE / Computer Vision Foundation, 2024: 26296-26306.
[26] Liu H, Li C, Wu Q, et al. Visual instruction tuning[J]. Advances in Neural Information Processing Systems, 2023, 36: 34892-34916.
[27] Dong Y, Chen H, Chen J, et al. How robust is google's bard to adversarial image attacks?[J]. arXiv preprint arXiv:2309.11751, 2023.
[28] Carlini N, Athalye A, Papernot N, et al. On evaluating adversarial robustness[J]. arXiv preprint arXiv:1902.06705, 2019.
[29] Wu W, Su Y, Chen X, et al. Boosting the transferability of adversarial samples via attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Washington, USA: IEEE / Computer Vision Foundation, 2020: 1161-1170.
[30] Bolya D, Hoffman J. Token merging for fast stable diffusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE / Computer Vision Foundation, 2023: 4599-4603.
[31] Tu H, Cui C, Wang Z, et al. How many unicorns are in this image? a safety evaluation benchmark for vision llms[J]. arXiv preprint arXiv:2311.16101, 2023.
[32] Xie P, Bie Y, Mao J, et al. Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 14679-14689.
[33] Zhang J, Ye J, Ma X, et al. AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 19900-19909.
|