作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (10): 1-15. doi: 10.19678/j.issn.1000-3428.0070036

• 热点与综述 • 上一篇    下一篇

基于视觉-语言预训练模型的零样本迁移学习方法综述

孙仁科1,2, 许靖昊1, 皇甫志宇1, 李仲年1,2, 许新征1,2,*()   

  1. 1. 中国矿业大学计算机科学与技术学院, 江苏 徐州 221116
    2. 矿山数字化教育部工程研究中心, 江苏 徐州 221116
  • 收稿日期:2024-06-25 出版日期:2024-10-15 发布日期:2024-10-24
  • 通讯作者: 许新征
  • 基金资助:
    国家自然科学基金(61976217); 国家自然科学基金(62306320); 江苏省自然科学基金(BK20231063)

Survey of Zero-Shot Transfer Learning Methods Based on Vision-Language Pre-Trained Models

SUN Renke1,2, XU Jinghao1, HUANGFU Zhiyu1, LI Zhongnian1,2, XU Xinzheng1,2,*()   

  1. 1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China
    2. Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou 221116, Jiangsu, China
  • Received:2024-06-25 Online:2024-10-15 Published:2024-10-24
  • Contact: XU Xinzheng

摘要:

近年来随着人工智能(AI)技术在计算机视觉与自然语言处理等单模态领域表现出愈发优异的性能, 多模态学习的重要性和必要性逐渐展现出来, 其中基于视觉-语言预训练模型的零样本迁移(ZST)方法得到了国内外研究者的广泛关注。得益于预训练模型强大的泛化性能, 使用视觉-语言预训练模型不仅能提高零样本识别任务的准确率, 而且能够解决部分传统方法无法解决的零样本下游任务问题。对基于视觉-语言预训练模型的ZST方法进行概述, 首先介绍了零样本学习(FSL)的传统方法, 并对其主要形式加以总结; 然后阐述了基于视觉-语言预训练模型的ZST和FSL的区别及其可以解决的新任务; 其次介绍了基于视觉-语言预训练模型的ZST方法在样本识别、目标检测、语义分割、跨模态生成等下游任务中的应用情况; 最后对现有的基于视觉-语言预训练模型的ZST方法存在的问题进行分析并对未来的研究方向进行展望。

关键词: 零样本学习, 视觉-语言预训练模型, 零样本迁移, 多模态, 计算机视觉

Abstract:

In recent years, remarkable advancements in Artificial Intelligence (AI) across unimodal domains, such as computer vision and Natural Language Processing (NLP), have highlighted the growing importance and necessity of multimodal learning. Among the emerging techniques, the Zero-Shot Transfer (ZST) method, based on visual-language pre-trained models, has garnered widespread attention from researchers worldwide. Owing to the robust generalization capabilities of pre-trained models, leveraging visual-language pre-trained models not only enhances the accuracy of zero-shot recognition tasks but also addresses certain zero-shot downstream tasks that are beyond the scope of conventional approaches. This review provides an overview of ZST methods based on vision-language pre-trained models. First, it introduces conventional approaches to Few-Shot Learning (FSL) and summarizes its main forms. It then discusses the distinctions between ZST and FSL based on vision-language pre-trained models, highlighting the new tasks that ZST can address. Subsequently, it explores the application of ZST methods in various downstream tasks, including sample recognition, object detection, semantic segmentation, and cross-modal generation. Finally, it analyzes the challenges of current ZST methods based on vision-language pre-trained models and outlines potential future research directions.

Key words: Zero-Shot Learning(ZSL), vision-language pre-trained model, Zero-Shot Transfer(ZST), multi-modal, computer vision