Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (4): 1-21. doi: 10.19678/j.issn.1000-3428.0260043

• Frontier Perspectives and Reviews • Previous Articles    

Review of Document Q&A Driven by Multimodal Retrieval-Augmented Generation (Invited)

LI Zeming, WANG Shuliang, SHANG Zihe, SHENG Ming   

  1. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:2026-01-09 Revised:2026-02-12 Published:2026-04-08

多模态检索增强生成驱动的文档问答综述(特邀)

李泽鸣, 王树良, 尚子贺, 盛明   

  1. 北京理工大学计算机学院, 北京 100081
  • 作者简介:李泽鸣,男,博士研究生,主研方向为数据智能;王树良(通信作者),教授,E-mail:slwang2011@bit.edu.cn;尚子贺,本科生;盛明,博士研究生。
  • 基金资助:
    国家自然科学基金(42371480,62306033)。

Abstract: Traditional Retrieval-Augmented Generation (RAG) methods predominantly focus on pure-text scenarios. In these scenarios, their retrieval and generation mechanisms encounter difficulties in effectively modeling common visual elements, spatial layouts, and structural semantics within multimodal documents. This drawback restricts their performance in tasks related to text-image hybridization, long documents, and cross-document reasoning. To tackle this issue, Multimodal Retrieval Augmented Generation (MRAG), by integrating text, image, and layout structure modeling, and incorporating multimodal evidence retrieval and scheduling during the generation process, has already developed into a core technical paradigm for Question & Answer (Q & A) and reasoning in visually-rich documents. This paper conducts a systematic review of research progress in MRAG applications for document Q & A tasks. Firstly, based on the practical requirements for multimodal document understanding, we analyze the key challenges in MRAG implementation, including multimodal alignment, long-context modeling, evidence traceability, and system robustness. Secondly, from the perspective of how MRAG systems support the generation process, we compare representative methods across four dimensions: embedding paradigms, document retrieval scope, layout-aware mechanisms, and multimodal retrieval strategies. We focus on how design choices influence generation stability, reasoning accuracy, and system complexity. Thirdly, we summarize the characteristics and limitations of existing multimodal document Q & A datasets and evaluation frameworks, and analyze the current constraints in evidence granularity and reasoning explainability. Finally, we point out that MRAG is evolving from static similarity-matching retrieval mechanisms to dynamic evidence planning paradigms centered on generation and reasoning needs, and should continuously enhance the reliability and explainability of complex document Q & A systems through collaborative multimodal modeling with multi-granularity approaches.

Key words: multimodal document, Multimodal Retrieval-Augmented Generation (MRAG), document Question & Answer (Q&A), generation-driven retrieval, layout-aware modeling, multimodal reasoning

摘要: 传统检索增强生成(RAG)方法主要面向纯文本场景,其检索与生成机制难以有效建模多模态文档中普遍存在的视觉元素、空间布局与结构语义,在图文混合、长文档及跨文档推理任务中表现受限。为此,多模态检索增强生成(MRAG)通过联合建模文本、图像与版式结构,在生成过程中引入多模态证据检索与调度,已然发展为视觉富文档问答与推理的核心技术范式。本文系统综述MRAG在文档问答任务中的研究进展。首先,围绕多模态文档理解的实际需求,分析MRAG在多模态对齐、长上下文建模、证据可追溯性及系统鲁棒性等面临的关键挑战。其次,立足MRAG系统支持生成过程的方式,分别从嵌入范式、文档检索范围、布局感知机制与多模态检索策略4个维度,梳理对比代表性方法,聚焦讨论不同设计选择对生成稳定性、推理精度与系统复杂度的影响。再次,总结现有多模态文档问答数据集与评测体系的特点与不足,分析当前评测在多模态证据粒度与推理可解释性方面的局限。最后,指出MRAG正由面向静态相似度匹配的检索机制,演进为以生成与推理需求为中心的动态证据规划范式,应通过多模态、多粒度协同建模,持续提升复杂文档问答系统的可靠性与可解释性。

关键词: 多模态文档, 多模态检索增强生成, 文档问答, 生成驱动检索, 布局感知建模, 多模态推理

CLC Number: