Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2024, Vol. 50 ›› Issue (2): 1-14. doi: 10.19678/j.issn.1000-3428.0067514

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of Text-based Visual Question Answering

Guide ZHU, Hai HUANG*()   

  1. School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, Zhejiang, China
  • Received:2023-04-26 Online:2024-02-15 Published:2024-02-21
  • Contact: Hai HUANG

文本视觉问答综述

朱贵德, 黄海*()   

  1. 浙江理工大学计算机科学与技术学院(人工智能学院), 浙江 杭州 310018
  • 通讯作者: 黄海
  • 基金资助:
    国家自然科学基金面上项目(62272416)

Abstract:

Traditional Visual Question Answering(VQA)only focuses on the visual object information in the image, ignoring the text information in the image. In addition to visual information, Text-based Visual Question Answering (TextVQA)also focuses on the text information in the image, which can answer questions more accurately and efficiently. In recent years, TextVQA has become a research focal point in the field of multimodality, and it has important application prospects in the field of scenes containing text information, such as automatic driving and scene understanding. This paper describes the concept of TextVQA and the existing problems and challenges, and makes a systematic analysis of TextVQA tasks from the aspects of methods, datasets, and future research directions. This study focuses on the analysis of the existing research methods of TextVQA, and summarizes them into three stages, namely, feature extraction, feature fusion, and answer prediction. According to the different methods used in the fusion stage, the TextVQA methods are described from three aspects: simple attention, Transformer-based, and pre-training methods. The advantages and disadvantages of different methods are summarized, and the performance of existing methods in public datasets is analyzed and compared. Four common public datasets are introduced, and their characteristics and evaluation metrics are analyzed. Finally, this paper discusses the problems and challenges facing the TextVQA task, and discusses the future research directions.

Key words: Text-based Visual Question Answering(TextVQA), text information, natural language processing, computer vision, multimodal fusion

摘要:

传统视觉问答(VQA)大多只关注图像中的视觉对象信息,忽略了对图像中文本信息的关注。文本视觉问答(TextVQA)除了视觉信息外还关注了图像中的文本信息,能够更加准确并高效地回答问题。近年来,TextVQA已经成为多模态领域的研究热点,在自动驾驶、场景理解等包含文本信息的场景中有重要的应用前景。阐述TextVQA的概念以及存在的问题与挑战,从方法、数据集、未来研究方向等方面对TextVQA任务进行系统性的分析。总结现有的TextVQA研究方法,并将其归纳为3个阶段,分别为特征提取阶段、特征融合阶段和答案预测阶段。根据融合阶段使用方法的不同,从简单注意力方法、基于Transformer方法和基于预训练方法这3个方面对TextVQA方法进行阐述,分析对比不同方法的特点以及在公开数据集中的表现。介绍TextVQA领域4种常用的公共数据集,并对它们的特点和评价指标进行分析。在此基础上,探讨当前TextVQA任务中存在的问题与挑战,并对该领域未来的研究方向进行展望。

关键词: 文本视觉问答, 文本信息, 自然语言处理, 计算机视觉, 多模态融合