[1] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[J]. Advances in Neural Information Processing Systems, 2020, 33: 9459-9474. [2] MEI L, MO S Y, YANG Z H, et al. A survey of multimodal retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2504.08748. [3] MATHEW M, KARATZAS D, JAWAHAR C V. DocVQA: a dataset for VQA on document images[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2021: 2199-2208. [4] XU Y H, LI M H, CUI L, et al. LayoutLM: pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 1192-1200. [5] XU Y, XU Y H, LÜ T C, et al. LayoutLMv2: multi-modal pre-training for visually-rich document understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2021: 2579-2591. [6] KIM G, HONG T, YIM M, et al. OCR-free document understanding Transformer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2111.15664. [7] APPALARAJU S, JASANI B, KOTA B U, et al. DocFormer: end-to-end Transformer for document understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 1-12. [8] 李子骏, 肖辉, 李雪峰. 面向知识密集型任务的检索增强生成技术综述[J]. 微电子学与计算机, 2025, 42(10): 48-65. LI Z J, XIAO H, LI X F. Survey on retrieval-augmented generation techniques for knowledge-intensive tasks[J]. Microelectronics & Computer, 2025, 42(10): 48-65. (in Chinese) [9] WANG Q C, DING R X, CHEN Z H, et al. ViDoRAG: visual document retrieval-augmented generation via dynamic iterative reasoning agents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.18017. [10] HU W B, GU J C, DOU Z Y, et al. MRAG-Bench: vision-centric evaluation for retrieval-augmented multimodal models[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.08182. [11] LIU Z H, ZHU X S, ZHOU T S, et al. Benchmarking retrieval-augmented generation in multi-modal contexts[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 4817-4826. [12] ABOOTORABI M M, ZOBEIRI A, DEHGHANI M, et al. Ask in any modality: a comprehensive survey on multimodal retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.08826. [13] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document Transformer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2004.05150. [14] HUANG Y P, LÜ T C, CUI L, et al. LayoutLMv3: pre-training for document AI with unified text and image masking[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4083-4091. [15] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Journal of Machine Learning Research, 2019, 21: 140: 1-140: 67. [16] JAUME G, EKENEL H K, THIRAN J P. FUNSD: a dataset for form understanding in noisy scanned documents[C]//Proceedings of the International Conference on Document Analysis and Recognition Workshops (ICDARW). Washington D.C., USA: IEEE Press, 2019: 1-6. [17] ZHANG X K, SONG D J, CHEN Y X, et al. Topology-aware embedding memory for continual learning on expanding networks[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2024: 4326-4337. [18] LI H P, WEI G C, XU H C, et al. DocPointer: a parameter-efficient pointer network for key information extraction[C]//Proceedings of the 6th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2024: 1-7. [19] YU Q H, XIAO Z Y, LI B H, et al. MRAMG-Bench: a comprehensive benchmark for advancing multimodal retrieval-augmented multimodal generation[C]//Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2025: 3616-3626. [20] CHEN Z L, ZHANG P, XU M Y, et al. LocatingGPT: a multi-modal document retrieval method based on retrieval-augmented generation[C]//Proceedings of the IEEE 9th International Conference on Data Science in Cyberspace (DSC). Washington D.C., USA: IEEE Press, 2025: 232-239. [21] MARTIN R, WALDEN W, KRIZ R, et al. Seeing through the MiRAGE: evaluating multimodal retrieval augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2510.24870. [22] SHEN Z X, YU J F, WANG W Y, et al. Global question-aware multimodal retrieval-augmented generation for multimedia multi-hop question answering[C]//Proceedings of the 7th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2025: 1-8. [23] WANG J H, ASHRAF T, HAN Z Y, et al. MIRA: a novel framework for fusing modalities in medical RAG[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 6307-6315. [24] YILMAZ R E, TAYSI M A, ÖZMEN Aİ, et al. Grounded answer generation over multimodal financial records via semantic indexing[C]//Proceedings of the 10th International Conference on Computer Science and Engineering. Washington D.C., USA: IEEE Press, 2025: 160-165. [25] MA X G, LIN S C, LI M H, et al. Unifying multimodal retrieval via document screenshot embedding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2406.11251. [26] YU S, TANG C Y, XU B K, et al. VisRAG: vision-based retrieval-augmented generation on multi-modality documents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.10594. [27] TANAKA R, IKI T, HASEGAWA T, et al. VDocRAG: retrieval-augmented generation over visually-rich documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24827-24837. [28] TANG Z N, YANG Z Y, WANG G X, et al. Unifying vision, text, and layout for universal document processing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 19254-19264. [29] YAMANO M, FUKUOKA K, MIYAMORI H. Two-stage approach using pretrained language models for question answering on Japanese document images[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 13791-13796. [30] MA Y B, LI J S, ZANG Y H, et al. Towards storage-efficient visual document retrieval: an empirical study on reducing patch-level embeddings[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.04997. [31] CHEN J, ZHANG R Y, ZHOU Y F, et al. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2411.01106. [32] CHO J, MAHATA D, IRSOY O, et al. M3DocRAG: multi-modal retrieval is what you need for multi-page multi-document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2411.04952. [33] JAIN C, WU Y R, ZENG Y F, et al. SimpleDoc: multi-modal document understanding with dual-cue page retrieval and iterative refinement[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.14035. [34] ZHAO D F. FRAG: toward federated vector database management for collaborative and secure retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.13272. [35] CHEN C, PETTERSON S, PHILLIPS R L, et al. Toward Graduate Medical Education (GME) accountability: measuring the outcomes of GME institutions[J]. Academic Medicine, 2013, 88(9): 1267-1280. [36] LIU P, LIU X, YAO R Y, et al. HM-RAG: hierarchical multi-agent multimodal retrieval augmented generation[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 2781-2790. [37] TIAN Y, LIU F, ZHANG J Y, et al. CoRe-MMRAG: cross-source knowledge reconciliation for multimodal RAG[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.02544. [38] SURI M, MATHUR P, DERNONCOURT F, et al. VisDoM: multi-document QA with visually rich elements using multimodal retrieval-augmented generation[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2025: 6088-6109. [39] XU M J, WANG Z H, CAI H X, et al. A multi-granularity retrieval framework for visually-rich documents[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.01457. [40] WANG Q C, DING R X, ZENG Y, et al. VRAG-RL: empower vision-perception-based RAG for visually rich information understanding via iterative reasoning with reinforcement learning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.22019. [41] YOHANNES H M, MAHMOUD Y, NAZEERUDDIN M, et al. Multimodal Retrieval and Fusion Framework (MRaFF)[C]//Proceedings of the 8th International Conference on Information and Computer Technologies (ICICT). Washington D.C., USA: IEEE Press, 2025: 186-191. [42] 王合庆, 魏杰, 景红雨, 等. Meta-RAG: 基于元数据驱动的电力领域检索增强生成框架[J]. 计算机工程, 2026, 52(2): 383-392. WANG H Q, WEI J, JING H Y, et al. Meta-RAG: a metadata-driven retrieval-augmented generation framework for the power industry[J]. Computer Engineering, 2026, 52(2): 383-392.(in Chinese) [43] GAO Y F, XIONG Y, GAO X Y, et al. Retrieval-augmented generation for large language models: a survey[EB/OL].[2025-12-27]. https://arxiv.org/abs/2312.10997. [44] BENCHAREF R, RAHICHE A, CHERIET M. DIVE-Doc: downscaling foundational image visual encoder into hierarchical architecture for DocVQA[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Washington D.C., USA: IEEE Press, 2026: 7597-7606. [45] YU W H, CHEN W, QI G Q, et al. BBox DocVQA: a large scale bounding box grounded dataset for enhancing reasoning in document visual question answer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2511.15090. [46] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Philadelphia, USA: Association for Computational Linguistics, 2020: 6769-6781. [47] GAO S S, ZHAO S S, JIANG X, et al. Scaling beyond context: a survey of multimodal retrieval-augmented generation for document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2510.15253. [48] ZHAO S Y, YANG Y Q, WANG Z L, et al. Retrieval Augmented Generation (RAG) and beyond: a comprehensive survey on how to make your LLMs use external data more wisely[EB/OL].[2025-12-27]. https://arxiv.org/abs/2409.14924. [49] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 14974-14983. [50] CUI L, XU Y, LÜ T, et al. Document AI: benchmarks, models and applications[EB/OL].[2025-12-27]. https://arxiv.org/abs/2111.08609. [51] WANG D S, RAMAN N, SIBUE M, et al. DocLLM: a layout-aware generative language model for multimodal document understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2024: 8529-8548. [52] LEE C Y, LI C L, DOZAT T, et al. FormNet: structural encoding beyond sequential modeling in form document information extraction[EB/OL].[2025-12-27]. https://arxiv.org/abs/2203.08411. [53] FAYSSE M, SIBILLE H, WU T, et al. ColPali: efficient document retrieval with vision language models[EB/OL].[2025-12-27]. https://arxiv.org/abs/2407.01449. [54] WANG P, BAI S, TAN S N, et al. Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution[EB/OL].[2025-12-27]. https://arxiv.org/abs/2409.12191. [55] BEYER L, STEINER A, PINTO A S, et al. PaliGemma: a versatile 3B VLM for transfer[EB/OL].[2025-12-27]. https://arxiv.org/abs/2407.07726. [56] WU Q Y, BANSAL G, ZHANG J Y, et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2308.08155. [57] LOCKARD C, SHIRALKAR P, DONG X L, et al. ZeroShotCeres: zero-shot relation extraction from semi-structured webpages[EB/OL].[2025-12-27]. https://arxiv.org/abs/2005.07105. [58] WANG J P, JIN L W, DING K. LiLT: a simple yet effective language-independent layout Transformer for structured document understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2202.13669. [59] EDGE D, TRINH H, CHENG N, et al. From local to global: a graph RAG approach to query-focused summarization[EB/OL].[2025-12-27]. https://arxiv.org/abs/2404.16130. [60] NGUYEN T, CHIN P, TAI Y W. MA-RAG: multi-agent retrieval-augmented generation via collaborative chain-of-thought reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2505.20096. [61] PINEDA V, DAYAN N, DE LARA E. Poster: leveraging geo-spatiality in geo-distributed vector databases[C]//Proceedings of the 10th ACM/IEEE Symposium on Edge Computing. New York, USA: ACM Press, 2025: 1-3. [62] WAGLE S, MUNIKOTI S, MEYUR R, et al. Leveraging multimodal AI for efficient data discovery in wind energy research[C]//Proceedings of Practice and Experience in Advanced Research Computing 2025: the Power of Collaboration. New York, USA: ACM Press, 2025: 1-3. [63] MOON J, HONG C. Multimodal clinical decision support for melanoma diagnosis using retrieval-augmented generation and vision-language models[C]//Proceedings of the IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS). Washington D.C., USA: IEEE Press, 2025: 1-6. [64] KEITA M, HAMIDOUCHE W, EUTAMENE H B, et al. REVEAL: a retrieval-augmented generation approach for contextual identification of synthetic visual content[C]//Proceedings of the 1st on Deepfake Forensics Workshop: Detection, Attribution, Recognition, and Adversarial Challenges in the Era of AI-Generated Media. New York, USA: ACM Press, 2025: 12-20. [65] ZENG Q X. Retrieval augmented 3D garment generation from single image[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 9648-9656. [66] MAO J B, ZHENG C F, LIU W L, et al. MGRAG: Multimodal grid-aware retrieval augmentation generation framework for power grid work tickets[J]. Pattern Recognition, 2026, 169: 111845. [67] GONG Z Y, MAI C C, HUANG Y H. MHier-RAG: multi-modal RAG for visual-rich document question-answering via hierarchical and multi-granularity reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/2508.00579. [68] YUAN X, NING L B, FAN W Q, et al. mKG-RAG: multimodal knowledge graph-enhanced RAG for visual question answering[EB/OL].[2025-12-27]. https://arxiv.org/abs/2508.05318. [69] YU S, TANG C Y, XU B K, et al. VisRAG: vision-based multi-modal document retrieval-augmented generation[EB/OL].[2025-12-27]. https://arxiv.org/abs/2410.10594. [70] YU B H, WU G W, YAO Z Y, et al. Beyond relevance: utility-driven retrieval for visual document question answering[C]//Proceedings of International Conference on Intelligent Computing. Singapore: Springer, 2025: 382-393. [71] CHOI Y, PARK J, YOON J, et al. Zero-shot multimodal document retrieval via cross-modal question generation[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics 2025: 26068-26083. [72] KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2020: 39-48. [73] WANG S N, ZHAO Y J, XIE Y L, et al. Towards reliable vector database management systems: a software testing roadmap for 2030[EB/OL].[2025-12-27]. https://arxiv.org/abs/2502.20812. [74] XU M J, DONG J H, HOU J, et al. MM-R5: MultiModal reasoning-enhanced ReRanker via reinforcement learning for document retrieval[EB/OL].[2025-12-27]. https://arxiv.org/abs/2506.12364. [75] ASAI A, WU Z Q, WANG Y Z, et al. Self-RAG: learning to retrieve, generate, and critique through self-reflection[EB/OL].[2025-12-27]. https://arxiv.org/abs/2310.11511. [76] CHANG Y S, CAO G H, NARANG M, et al. WebQA: multihop and multimodal QA[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 16474-16483. [77] LERNER P, FERRET O, GUINAUDEAU C, et al. ViQuAE, a dataset for knowledge-based visual question answering about named entities[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2022: 3108-3120. [78] SINGH H, NASERY A, MEHTA D, et al. MIMOQA: multimodal input multimodal output question answering[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Philadelphia, USA: Association for Computational Linguistics, 2021: 5317-5332. [79] DU Y X, SONG J R, ZHOU Y F, et al. G2-Reader: dual evolving graphs for multimodal document QA[EB/OL].[2025-12-27]. https://arxiv.org/abs/2601.22055 [80] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 3190-3199. [81] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of ECCV’22. Berlin, Germany: Springer, 2022: 146-162. [82] SHAH S, MISHRA A, YADATI N, et al. KVQA: knowledge-aware visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 8876-8884. [83] WANG P, WU Q, SHEN C H, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10): 2413-2427. [84] JAIN A, KOTHYARI M, KUMAR V, et al. Select, substitute, search: a new benchmark for knowledge-augmented visual question answering[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2021: 2491-2498. [85] MATHEW M, BAGAL V, TITO R, et al. InfographicVQA[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2022: 2582-2591. [86] TANAKA R, NISHIDA K, NISHIDA K, et al. SlideVQA: a dataset for document visual question answering on multiple images[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 13636-13645. [87] MASRY A, LONG D X, TAN J Q, et al. ChartQA: a benchmark for question answering about charts with visual and logical reasoning[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022. Philadelphia, USA: Association for Computational Linguistics, 2022: 2263-2279. [88] ZHU F B, LEI W Q, FENG F L, et al. Towards complex document understanding by discrete reasoning[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4857-4866. [89] VAN LANDEGHEM J, POWALSKI R, TITO R, et al. Document Understanding Dataset and Evaluation (DUDE)[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 19471-19483. [90] CHELLAPPA R, PRAMANICK S, VENUGOPALAN S. SPIQA: a dataset for multimodal question answering on scientific papers[C]//Proceedings of the Advances in Neural Information Processing Systems. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 118807-118833. [91] TITO R, KARATZAS D, VALVENY E. Hierarchical multimodal Transformers for multipage DocVQA[J]. Pattern Recognition, 2023, 144: 109834. [92] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 8309-8318. [93] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). Washington D.C., USA: IEEE Press, 2020: 947-952. [94] CHEN X Y, ZHAO Z H, CHEN L, et al. WebSRC: a dataset for Web-based structural reading comprehension[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: ACL Press, 2021: 4173-4185. [95] LI B H, GE Y Y, CHEN Y, et al. SEED-Bench-2-Plus: benchmarking multimodal large language models with text-rich visual comprehension[EB/OL].[2025-12-27]. https://arxiv.org/abs/2404.16790. [96] GOYAL Y, KHOT T, AGRAWAL A, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[J]. International Journal of Computer Vision, 2019, 127(4): 398-414. [97] ZELLERS R, BISK Y, FARHADI A, et al. From recognition to cognition: visual commonsense reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 6713-6724. [98] LIU Y, DUAN H D, ZHANG Y H, et al. MMBench: is your multi-modal model an all-around player?[C]// Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2025: 216-233. [99] LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[EB/OL].[2025-12-27]. https://arxiv.org/abs/2209.09513. [100] LU P, BANSAL H, XIA T, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts[EB/OL].[2025-12-27]. https://arxiv.org/abs/2310.02255. [101] KAHOU S E, MICHALSKI V, ATKINSON A, et al. FigureQA: an annotated figure dataset for visual reasoning[EB/OL].[2025-12-27]. https://arxiv.org/abs/1710.07300. [102] FU C Y, DAI Y H, LUO Y D, et al. Video-MME: the first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24108-24118. [103] YU Z, XU D J, YU J, et al. ActivityNet-QA: a dataset for understanding complex Web videos via question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 9127-9134. [104] MANGALAM K, AKSHULAKOV R, MALIK J. EgoSchema: a diagnostic benchmark for very long-form video language understanding[EB/OL].[2025-12-27]. https://arxiv.org/abs/2308.09126. [105] YI D Y, ZHU G B, DING C L, et al. MME-industry: a cross-industry multimodal evaluation benchmark[EB/OL].[2025-12-27]. https://arxiv.org/abs/2501.16688. [106] LIU Y L, LI Z, HUANG M X, et al. OCRBench: on the hidden mystery of OCR in large multimodal models[J]. Science China Information Sciences, 2024, 67(12): 220102. [107] HAN X, LI Z, CAO H, et al. Multimodal spatio-temporal data visualization technologies for contemporary urban landscape architecture: a review and prospect in the context of smart cities[J]. Land, 2025, 14(5): 1069. [108] RHAIEM M A B, SELMI M, FARAH I R, et al. Leveraging volunteered geographical information and spatio temporal big data in disaster management: opportunity and challenges[J]. International Journal of Data Science and Analytics, 2025, 21(1): 25. [109] KUMAR R, BHANU M, MENDES-MOREIRA J, et al. Spatio-temporal predictive modeling techniques for different domains: a survey[J]. ACM Computing Surveys, 2025, 57(2): 1-42. [110] CAO Y, STEFFEY S, HE J B, et al. Medical image retrieval: a multimodal approach[J]. Cancer Informatics, 2014, 13(3): 125-136. [111] SHAIK T, TAO X H, LI L, et al. A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom[J]. Information Fusion, 2024, 102: 102040. [112] CHEN Y, GE X K, YANG S L, et al. A survey on multimodal knowledge graphs: construction, completion and applications[J]. Mathematics, 2023, 11(8): 1815. [113] WEN J, ZHANG X, RUSH E, et al. Multimodal representation learning for predicting molecule—disease relations[J]. Bioinformatics, 2023, 39(2): btad085. [114] KREUTZ C K, SCHENKEL R. Scientific paper recommendation systems: a literature review of recent publications[J]. International Journal on Digital Libraries, 2022, 23(4): 335-369. |