Review of Document Q&A Driven by Multimodal Retrieval-Augmented Generation (Invited)

doi:10.19678/j.issn.1000-3428.0260043

Abstract

Abstract: Traditional Retrieval-Augmented Generation (RAG) methods predominantly focus on pure-text scenarios. In these scenarios, their retrieval and generation mechanisms encounter difficulties in effectively modeling common visual elements, spatial layouts, and structural semantics within multimodal documents. This drawback restricts their performance in tasks related to text-image hybridization, long documents, and cross-document reasoning. To tackle this issue, Multimodal Retrieval Augmented Generation (MRAG), by integrating text, image, and layout structure modeling, and incorporating multimodal evidence retrieval and scheduling during the generation process, has already developed into a core technical paradigm for Question & Answer (Q & A) and reasoning in visually-rich documents. This paper conducts a systematic review of research progress in MRAG applications for document Q & A tasks. Firstly, based on the practical requirements for multimodal document understanding, we analyze the key challenges in MRAG implementation, including multimodal alignment, long-context modeling, evidence traceability, and system robustness. Secondly, from the perspective of how MRAG systems support the generation process, we compare representative methods across four dimensions: embedding paradigms, document retrieval scope, layout-aware mechanisms, and multimodal retrieval strategies. We focus on how design choices influence generation stability, reasoning accuracy, and system complexity. Thirdly, we summarize the characteristics and limitations of existing multimodal document Q & A datasets and evaluation frameworks, and analyze the current constraints in evidence granularity and reasoning explainability. Finally, we point out that MRAG is evolving from static similarity-matching retrieval mechanisms to dynamic evidence planning paradigms centered on generation and reasoning needs, and should continuously enhance the reliability and explainability of complex document Q & A systems through collaborative multimodal modeling with multi-granularity approaches.

Key words: multimodal document, Multimodal Retrieval-Augmented Generation (MRAG), document Question & Answer (Q&A), generation-driven retrieval, layout-aware modeling, multimodal reasoning

摘要： 传统检索增强生成(RAG)方法主要面向纯文本场景,其检索与生成机制难以有效建模多模态文档中普遍存在的视觉元素、空间布局与结构语义,在图文混合、长文档及跨文档推理任务中表现受限。为此,多模态检索增强生成(MRAG)通过联合建模文本、图像与版式结构,在生成过程中引入多模态证据检索与调度,已然发展为视觉富文档问答与推理的核心技术范式。本文系统综述MRAG在文档问答任务中的研究进展。首先,围绕多模态文档理解的实际需求,分析MRAG在多模态对齐、长上下文建模、证据可追溯性及系统鲁棒性等面临的关键挑战。其次,立足MRAG系统支持生成过程的方式,分别从嵌入范式、文档检索范围、布局感知机制与多模态检索策略4个维度,梳理对比代表性方法,聚焦讨论不同设计选择对生成稳定性、推理精度与系统复杂度的影响。再次,总结现有多模态文档问答数据集与评测体系的特点与不足,分析当前评测在多模态证据粒度与推理可解释性方面的局限。最后,指出MRAG正由面向静态相似度匹配的检索机制,演进为以生成与推理需求为中心的动态证据规划范式,应通过多模态、多粒度协同建模,持续提升复杂文档问答系统的可靠性与可解释性。

关键词: 多模态文档, 多模态检索增强生成, 文档问答, 生成驱动检索, 布局感知建模, 多模态推理

CLC Number:

TP391

LI Zeming, WANG Shuliang, SHANG Zihe, SHENG Ming. Review of Document Q&A Driven by Multimodal Retrieval-Augmented Generation (Invited)[J]. Computer Engineering, 2026, 52(4): 1-21.

李泽鸣, 王树良, 尚子贺, 盛明. 多模态检索增强生成驱动的文档问答综述(特邀)[J]. 计算机工程, 2026, 52(4): 1-21.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260043

https://www.ecice06.com/EN/Y2026/V52/I4/1

References

[1] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in Neural Information Processing Systems, 2020, 33: 9459-9474.
[2] MEI L, MO S Y, YANG Z H, et al. A survey of multimodal retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2504.08748.
[3] MATHEW M, KARATZAS D, JAWAHAR C V. DocVQA: a dataset for VQA on document images[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2021: 2199-2208.
[4] XU Y H, LI M H, CUI L, et al. LayoutLM: pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 1192-1200.
[5] XU Y, XU Y H, LÜ T C, et al. LayoutLMv2: multi-modal pre-training for visually-rich document understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2021: 2579-2591.
[6] KIM G, HONG T, YIM M, et al. OCR-free document understanding Transformer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2111.15664.
[7] APPALARAJU S, JASANI B, KOTA B U, et al. DocFormer: end-to-end Transformer for document understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 1-12.
[8] 李子骏, 肖辉, 李雪峰. 面向知识密集型任务的检索增强生成技术综述[J]. 微电子学与计算机, 2025, 42(10): 48-65. LI Z J, XIAO H, LI X F. Survey on retrieval-augmented generation techniques for knowledge-intensive tasks[J]. Microelectronics & Computer, 2025, 42(10): 48-65. (in Chinese)
[9] WANG Q C, DING R X, CHEN Z H, et al. ViDoRAG: visual document retrieval-augmented generation via dynamic iterative reasoning agents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.18017.
[10] HU W B, GU J C, DOU Z Y, et al. MRAG-Bench: vision-centric evaluation for retrieval-augmented multimodal models[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.08182.
[11] LIU Z H, ZHU X S, ZHOU T S, et al. Benchmarking retrieval-augmented generation in multi-modal contexts[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 4817-4826.
[12] ABOOTORABI M M, ZOBEIRI A, DEHGHANI M, et al. Ask in any modality: a comprehensive survey on multimodal retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.08826.
[13] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document Transformer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2004.05150.
[14] HUANG Y P, LÜ T C, CUI L, et al. LayoutLMv3: pre-training for document AI with unified text and image masking[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4083-4091.
[15] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Journal of Machine Learning Research, 2019, 21: 140: 1-140: 67.
[16] JAUME G, EKENEL H K, THIRAN J P. FUNSD: a dataset for form understanding in noisy scanned documents[C]//Proceedings of the International Conference on Document Analysis and Recognition Workshops (ICDARW). Washington D.C., USA: IEEE Press, 2019: 1-6.
[17] ZHANG X K, SONG D J, CHEN Y X, et al. Topology-aware embedding memory for continual learning on expanding networks[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2024: 4326-4337.
[18] LI H P, WEI G C, XU H C, et al. DocPointer: a parameter-efficient pointer network for key information extraction[C]//Proceedings of the 6th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2024: 1-7.
[19] YU Q H, XIAO Z Y, LI B H, et al. MRAMG-Bench: a comprehensive benchmark for advancing multimodal retrieval-augmented multimodal generation[C]//Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2025: 3616-3626.
[20] CHEN Z L, ZHANG P, XU M Y, et al. LocatingGPT: a multi-modal document retrieval method based on retrieval-augmented generation[C]//Proceedings of the IEEE 9th International Conference on Data Science in Cyberspace (DSC). Washington D.C., USA: IEEE Press, 2025: 232-239.
[21] MARTIN R, WALDEN W, KRIZ R, et al. Seeing through the MiRAGE: evaluating multimodal retrieval augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2510.24870.
[22] SHEN Z X, YU J F, WANG W Y, et al. Global question-aware multimodal retrieval-augmented generation for multimedia multi-hop question answering[C]//Proceedings of the 7th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2025: 1-8.
[23] WANG J H, ASHRAF T, HAN Z Y, et al. MIRA: a novel framework for fusing modalities in medical RAG[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 6307-6315.
[24] YILMAZ R E, TAYSI M A, ÖZMEN Aİ, et al. Grounded answer generation over multimodal financial records via semantic indexing[C]//Proceedings of the 10th International Conference on Computer Science and Engineering. Washington D.C., USA: IEEE Press, 2025: 160-165.
[25] MA X G, LIN S C, LI M H, et al. Unifying multimodal retrieval via document screenshot embedding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2406.11251.
[26] YU S, TANG C Y, XU B K, et al. VisRAG: vision-based retrieval-augmented generation on multi-modality documents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.10594.
[27] TANAKA R, IKI T, HASEGAWA T, et al. VDocRAG: retrieval-augmented generation over visually-rich documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24827-24837.
[28] TANG Z N, YANG Z Y, WANG G X, et al. Unifying vision, text, and layout for universal document processing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 19254-19264.
[29] YAMANO M, FUKUOKA K, MIYAMORI H. Two-stage approach using pretrained language models for question answering on Japanese document images[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 13791-13796.
[30] MA Y B, LI J S, ZANG Y H, et al. Towards storage-efficient visual document retrieval: an empirical study on reducing patch-level embeddings[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.04997.
[31] CHEN J, ZHANG R Y, ZHOU Y F, et al. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2411.01106.
[32] CHO J, MAHATA D, IRSOY O, et al. M3DocRAG: multi-modal retrieval is what you need for multi-page multi-document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2411.04952.
[33] JAIN C, WU Y R, ZENG Y F, et al. SimpleDoc: multi-modal document understanding with dual-cue page retrieval and iterative refinement[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.14035.
[34] ZHAO D F. FRAG: toward federated vector database management for collaborative and secure retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.13272.
[35] CHEN C, PETTERSON S, PHILLIPS R L, et al. Toward Graduate Medical Education (GME) accountability: measuring the outcomes of GME institutions[J]. Academic Medicine, 2013, 88(9): 1267-1280.
[36] LIU P, LIU X, YAO R Y, et al. HM-RAG: hierarchical multi-agent multimodal retrieval augmented generation[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 2781-2790.
[37] TIAN Y, LIU F, ZHANG J Y, et al. CoRe-MMRAG: cross-source knowledge reconciliation for multimodal RAG[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.02544.
[38] SURI M, MATHUR P, DERNONCOURT F, et al. VisDoM: multi-document QA with visually rich elements using multimodal retrieval-augmented generation[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2025: 6088-6109.
[39] XU M J, WANG Z H, CAI H X, et al. A multi-granularity retrieval framework for visually-rich documents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.01457.
[40] WANG Q C, DING R X, ZENG Y, et al. VRAG-RL: empower vision-perception-based RAG for visually rich information understanding via iterative reasoning with reinforcement learning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.22019.
[41] YOHANNES H M, MAHMOUD Y, NAZEERUDDIN M, et al. Multimodal Retrieval and Fusion Framework (MRaFF)[C]//Proceedings of the 8th International Conference on Information and Computer Technologies (ICICT). Washington D.C., USA: IEEE Press, 2025: 186-191.
[42] 王合庆, 魏杰, 景红雨, 等. Meta-RAG: 基于元数据驱动的电力领域检索增强生成框架[J]. 计算机工程, 2026, 52(2): 383-392. WANG H Q, WEI J, JING H Y, et al. Meta-RAG: a metadata-driven retrieval-augmented generation framework for the power industry[J]. Computer Engineering, 2026, 52(2): 383-392.(in Chinese)
[43] GAO Y F, XIONG Y, GAO X Y, et al. Retrieval-augmented generation for large language models: a survey[EB/OL].[2025-12-27]. https://arxiv.org/abs/2312.10997.
[44] BENCHAREF R, RAHICHE A, CHERIET M. DIVE-Doc: downscaling foundational image visual encoder into hierarchical architecture for DocVQA[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Washington D.C., USA: IEEE Press, 2026: 7597-7606.
[45] YU W H, CHEN W, QI G Q, et al. BBox DocVQA: a large scale bounding box grounded dataset for enhancing reasoning in document visual question answer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2511.15090.
[46] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Philadelphia, USA: Association for Computational Linguistics, 2020: 6769-6781.
[47] GAO S S, ZHAO S S, JIANG X, et al. Scaling beyond context: a survey of multimodal retrieval-augmented generation for document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2510.15253.
[48] ZHAO S Y, YANG Y Q, WANG Z L, et al. Retrieval Augmented Generation (RAG) and beyond: a comprehensive survey on how to make your LLMs use external data more wisely[EB/OL].[2025-12-27]. https://arxiv.org/abs/2409.14924.
[49] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 14974-14983.
[50] CUI L, XU Y, LÜ T, et al. Document AI: benchmarks, models and applications[EB/OL].[2025-12-27]. https://arxiv.org/abs/2111.08609.
[51] WANG D S, RAMAN N, SIBUE M, et al. DocLLM: a layout-aware generative language model for multimodal document understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2024: 8529-8548.
[52] LEE C Y, LI C L, DOZAT T, et al. FormNet: structural encoding beyond sequential modeling in form document information extraction[EB/OL].[2025-12-27]. https://arxiv.org/abs/2203.08411.
[53] FAYSSE M, SIBILLE H, WU T, et al. ColPali: efficient document retrieval with vision language models[EB/OL].[2025-12-27]. https://arxiv.org/abs/2407.01449.
[54] WANG P, BAI S, TAN S N, et al. Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution[EB/OL].[2025-12-27]. https://arxiv.org/abs/2409.12191.
[55] BEYER L, STEINER A, PINTO A S, et al. PaliGemma: a versatile 3B VLM for transfer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2407.07726.
[56] WU Q Y, BANSAL G, ZHANG J Y, et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2308.08155.
[57] LOCKARD C, SHIRALKAR P, DONG X L, et al. ZeroShotCeres: zero-shot relation extraction from semi-structured webpages[EB/OL].[2025-12-27]. https://arxiv.org/abs/2005.07105.
[58] WANG J P, JIN L W, DING K. LiLT: a simple yet effective language-independent layout Transformer for structured document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2202.13669.
[59] EDGE D, TRINH H, CHENG N, et al. From local to global: a graph RAG approach to query-focused summarization[EB/OL].[2025-12-27]. https://arxiv.org/abs/2404.16130.
[60] NGUYEN T, CHIN P, TAI Y W. MA-RAG: multi-agent retrieval-augmented generation via collaborative chain-of-thought reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.20096.
[61] PINEDA V, DAYAN N, DE LARA E. Poster: leveraging geo-spatiality in geo-distributed vector databases[C]//Proceedings of the 10th ACM/IEEE Symposium on Edge Computing. New York, USA: ACM Press, 2025: 1-3.
[62] WAGLE S, MUNIKOTI S, MEYUR R, et al. Leveraging multimodal AI for efficient data discovery in wind energy research[C]//Proceedings of Practice and Experience in Advanced Research Computing 2025: the Power of Collaboration. New York, USA: ACM Press, 2025: 1-3.
[63] MOON J, HONG C. Multimodal clinical decision support for melanoma diagnosis using retrieval-augmented generation and vision-language models[C]//Proceedings of the IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS). Washington D.C., USA: IEEE Press, 2025: 1-6.
[64] KEITA M, HAMIDOUCHE W, EUTAMENE H B, et al. REVEAL: a retrieval-augmented generation approach for contextual identification of synthetic visual content[C]//Proceedings of the 1st on Deepfake Forensics Workshop: Detection, Attribution, Recognition, and Adversarial Challenges in the Era of AI-Generated Media. New York, USA: ACM Press, 2025: 12-20.
[65] ZENG Q X. Retrieval augmented 3D garment generation from single image[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 9648-9656.
[66] MAO J B, ZHENG C F, LIU W L, et al. MGRAG: Multimodal grid-aware retrieval augmentation generation framework for power grid work tickets[J]. Pattern Recognition, 2026, 169: 111845.
[67] GONG Z Y, MAI C C, HUANG Y H. MHier-RAG: multi-modal RAG for visual-rich document question-answering via hierarchical and multi-granularity reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2508.00579.
[68] YUAN X, NING L B, FAN W Q, et al. mKG-RAG: multimodal knowledge graph-enhanced RAG for visual question answering[EB/OL].[2025-12-27]. https://arxiv.org/abs/2508.05318.
[69] YU S, TANG C Y, XU B K, et al. VisRAG: vision-based multi-modal document retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.10594.
[70] YU B H, WU G W, YAO Z Y, et al. Beyond relevance: utility-driven retrieval for visual document question answering[C]//Proceedings of International Conference on Intelligent Computing. Singapore: Springer, 2025: 382-393.
[71] CHOI Y, PARK J, YOON J, et al. Zero-shot multimodal document retrieval via cross-modal question generation[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics 2025: 26068-26083.
[72] KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2020: 39-48.
[73] WANG S N, ZHAO Y J, XIE Y L, et al. Towards reliable vector database management systems: a software testing roadmap for 2030[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.20812.
[74] XU M J, DONG J H, HOU J, et al. MM-R5: MultiModal reasoning-enhanced ReRanker via reinforcement learning for document retrieval[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.12364.
[75] ASAI A, WU Z Q, WANG Y Z, et al. Self-RAG: learning to retrieve, generate, and critique through self-reflection[EB/OL].[2025-12-27]. https://arxiv.org/abs/2310.11511.
[76] CHANG Y S, CAO G H, NARANG M, et al. WebQA: multihop and multimodal QA[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 16474-16483.
[77] LERNER P, FERRET O, GUINAUDEAU C, et al. ViQuAE, a dataset for knowledge-based visual question answering about named entities[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2022: 3108-3120.
[78] SINGH H, NASERY A, MEHTA D, et al. MIMOQA: multimodal input multimodal output question answering[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Philadelphia, USA: Association for Computational Linguistics, 2021: 5317-5332.
[79] DU Y X, SONG J R, ZHOU Y F, et al. G²-Reader: dual evolving graphs for multimodal document QA[EB/OL].[2025-12-27]. https://arxiv.org/abs/2601.22055
[80] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 3190-3199.
[81] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of ECCV’22. Berlin, Germany: Springer, 2022: 146-162.
[82] SHAH S, MISHRA A, YADATI N, et al. KVQA: knowledge-aware visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 8876-8884.
[83] WANG P, WU Q, SHEN C H, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10): 2413-2427.
[84] JAIN A, KOTHYARI M, KUMAR V, et al. Select, substitute, search: a new benchmark for knowledge-augmented visual question answering[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2021: 2491-2498.
[85] MATHEW M, BAGAL V, TITO R, et al. InfographicVQA[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2022: 2582-2591.
[86] TANAKA R, NISHIDA K, NISHIDA K, et al. SlideVQA: a dataset for document visual question answering on multiple images[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 13636-13645.
[87] MASRY A, LONG D X, TAN J Q, et al. ChartQA: a benchmark for question answering about charts with visual and logical reasoning[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022. Philadelphia, USA: Association for Computational Linguistics, 2022: 2263-2279.
[88] ZHU F B, LEI W Q, FENG F L, et al. Towards complex document understanding by discrete reasoning[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4857-4866.
[89] VAN LANDEGHEM J, POWALSKI R, TITO R, et al. Document Understanding Dataset and Evaluation (DUDE)[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 19471-19483.
[90] CHELLAPPA R, PRAMANICK S, VENUGOPALAN S. SPIQA: a dataset for multimodal question answering on scientific papers[C]//Proceedings of the Advances in Neural Information Processing Systems. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 118807-118833.
[91] TITO R, KARATZAS D, VALVENY E. Hierarchical multimodal Transformers for multipage DocVQA[J]. Pattern Recognition, 2023, 144: 109834.
[92] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 8309-8318.
[93] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). Washington D.C., USA: IEEE Press, 2020: 947-952.
[94] CHEN X Y, ZHAO Z H, CHEN L, et al. WebSRC: a dataset for Web-based structural reading comprehension[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: ACL Press, 2021: 4173-4185.
[95] LI B H, GE Y Y, CHEN Y, et al. SEED-Bench-2-Plus: benchmarking multimodal large language models with text-rich visual comprehension[EB/OL].[2025-12-27]. https://arxiv.org/abs/2404.16790.
[96] GOYAL Y, KHOT T, AGRAWAL A, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[J]. International Journal of Computer Vision, 2019, 127(4): 398-414.
[97] ZELLERS R, BISK Y, FARHADI A, et al. From recognition to cognition: visual commonsense reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 6713-6724.
[98] LIU Y, DUAN H D, ZHANG Y H, et al. MMBench: is your multi-modal model an all-around player?[C]// Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2025: 216-233.
[99] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[EB/OL].[2025-12-27]. https://arxiv.org/abs/2209.09513.
[100] LU P, BANSAL H, XIA T, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts[EB/OL].[2025-12-27]. https://arxiv.org/abs/2310.02255.
[101] KAHOU S E, MICHALSKI V, ATKINSON A, et al. FigureQA: an annotated figure dataset for visual reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/1710.07300.
[102] FU C Y, DAI Y H, LUO Y D, et al. Video-MME: the first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24108-24118.
[103] YU Z, XU D J, YU J, et al. ActivityNet-QA: a dataset for understanding complex Web videos via question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 9127-9134.
[104] MANGALAM K, AKSHULAKOV R, MALIK J. EgoSchema: a diagnostic benchmark for very long-form video language understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2308.09126.
[105] YI D Y, ZHU G B, DING C L, et al. MME-industry: a cross-industry multimodal evaluation benchmark[EB/OL].[2025-12-27]. https://arxiv.org/abs/2501.16688.
[106] LIU Y L, LI Z, HUANG M X, et al. OCRBench: on the hidden mystery of OCR in large multimodal models[J]. Science China Information Sciences, 2024, 67(12): 220102.
[107] HAN X, LI Z, CAO H, et al. Multimodal spatio-temporal data visualization technologies for contemporary urban landscape architecture: a review and prospect in the context of smart cities[J]. Land, 2025, 14(5): 1069.
[108] RHAIEM M A B, SELMI M, FARAH I R, et al. Leveraging volunteered geographical information and spatio temporal big data in disaster management: opportunity and challenges[J]. International Journal of Data Science and Analytics, 2025, 21(1): 25.
[109] KUMAR R, BHANU M, MENDES-MOREIRA J, et al. Spatio-temporal predictive modeling techniques for different domains: a survey[J]. ACM Computing Surveys, 2025, 57(2): 1-42.
[110] CAO Y, STEFFEY S, HE J B, et al. Medical image retrieval: a multimodal approach[J]. Cancer Informatics, 2014, 13(3): 125-136.
[111] SHAIK T, TAO X H, LI L, et al. A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom[J]. Information Fusion, 2024, 102: 102040.
[112] CHEN Y, GE X K, YANG S L, et al. A survey on multimodal knowledge graphs: construction, completion and applications[J]. Mathematics, 2023, 11(8): 1815.
[113] WEN J, ZHANG X, RUSH E, et al. Multimodal representation learning for predicting molecule—disease relations[J]. Bioinformatics, 2023, 39(2): btad085.
[114] KREUTZ C K, SCHENKEL R. Scientific paper recommendation systems: a literature review of recent publications[J]. International Journal on Digital Libraries, 2022, 23(4): 335-369.

Please choose a citation manager

Content to export