| 1 |
LEWIS P , PEREZ E , PIKTUS A , et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 2020, 33, 9459- 9474.
|
| 2 |
|
| 3 |
MATHEW M, KARATZAS D, JAWAHAR C V. DocVQA: a dataset for VQA on document images[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2021: 2199-2208.
|
| 4 |
XU Y H, LI M H, CUI L, et al. LayoutLM: pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 1192-1200.
|
| 5 |
XU Y, XU Y H, LÜ T C, et al. LayoutLMv2: multi-modal pre-training for visually-rich document understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2021: 2579-2591.
|
| 6 |
|
| 7 |
APPALARAJU S, JASANI B, KOTA B U, et al. DocFormer: end-to-end Transformer for document understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 1-12.
|
| 8 |
李子骏, 肖辉, 李雪峰. 面向知识密集型任务的检索增强生成技术综述. 微电子学与计算机, 2025, 42 (10): 48- 65.
|
|
LI Z J , XIAO H , LI X F . Survey on retrieval-augmented generation techniques for knowledge-intensive tasks. Microelectronics & Computer, 2025, 42 (10): 48- 65.
|
| 9 |
WANG Q C, DING R X, CHEN Z H, et al. ViDoRAG: visual document retrieval-augmented generation via dynamic iterative reasoning agents[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2502.18017.
|
| 10 |
HU W B, GU J C, DOU Z Y, et al. MRAG-Bench: vision-centric evaluation for retrieval-augmented multimodal models[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2410.08182.
|
| 11 |
LIU Z H, ZHU X S, ZHOU T S, et al. Benchmarking retrieval-augmented generation in multi-modal contexts[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 4817-4826.
|
| 12 |
ABOOTORABI M M, ZOBEIRI A, DEHGHANI M, et al. Ask in any modality: a comprehensive survey on multimodal retrieval-augmented generation[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2502.08826.
|
| 13 |
|
| 14 |
HUANG Y P, LÜ T C, CUI L, et al. LayoutLMv3: pre-training for document AI with unified text and image masking[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4083-4091.
|
| 15 |
RAFFEL C , SHAZEER N , ROBERTS A , et al. Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research, 2019, 21, 140: 1-140: 67.
|
| 16 |
JAUME G, EKENEL H K, THIRAN J P. FUNSD: a dataset for form understanding in noisy scanned documents[C]//Proceedings of the International Conference on Document Analysis and Recognition Workshops (ICDARW). Washington D.C., USA: IEEE Press, 2019: 1-6.
|
| 17 |
ZHANG X K, SONG D J, CHEN Y X, et al. Topology-aware embedding memory for continual learning on expanding networks[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2024: 4326-4337.
|
| 18 |
LI H P, WEI G C, XU H C, et al. DocPointer: a parameter-efficient pointer network for key information extraction[C]//Proceedings of the 6th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2024: 1-7.
|
| 19 |
YU Q H, XIAO Z Y, LI B H, et al. MRAMG-Bench: a comprehensive benchmark for advancing multimodal retrieval-augmented multimodal generation[C]//Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2025: 3616-3626.
|
| 20 |
CHEN Z L, ZHANG P, XU M Y, et al. LocatingGPT: a multi-modal document retrieval method based on retrieval-augmented generation[C]//Proceedings of the IEEE 9th International Conference on Data Science in Cyberspace (DSC). Washington D.C., USA: IEEE Press, 2025: 232-239.
|
| 21 |
MARTIN R, WALDEN W, KRIZ R, et al. Seeing through the MiRAGE: evaluating multimodal retrieval augmented generation[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2510.24870.
|
| 22 |
SHEN Z X, YU J F, WANG W Y, et al. Global question-aware multimodal retrieval-augmented generation for multimedia multi-hop question answering[C]//Proceedings of the 7th ACM International Conference on Multimedia in Asia. New York, USA: ACM Press, 2025: 1-8.
|
| 23 |
WANG J H, ASHRAF T, HAN Z Y, et al. MIRA: a novel framework for fusing modalities in medical RAG[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 6307-6315.
|
| 24 |
YILMAZ R E, TAYSI M A, ÖZMEN Aİ, et al. Grounded answer generation over multimodal financial records via semantic indexing[C]//Proceedings of the 10th International Conference on Computer Science and Engineering. Washington D.C., USA: IEEE Press, 2025: 160-165.
|
| 25 |
|
| 26 |
YU S, TANG C Y, XU B K, et al. VisRAG: vision-based retrieval-augmented generation on multi-modality documents[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2410.10594.
|
| 27 |
TANAKA R, IKI T, HASEGAWA T, et al. VDocRAG: retrieval-augmented generation over visually-rich documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24827-24837.
|
| 28 |
TANG Z N, YANG Z Y, WANG G X, et al. Unifying vision, text, and layout for universal document processing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 19254-19264.
|
| 29 |
YAMANO M, FUKUOKA K, MIYAMORI H. Two-stage approach using pretrained language models for question answering on Japanese document images[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 13791-13796.
|
| 30 |
MA Y B, LI J S, ZANG Y H, et al. Towards storage-efficient visual document retrieval: an empirical study on reducing patch-level embeddings[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2506.04997.
|
| 31 |
CHEN J, ZHANG R Y, ZHOU Y F, et al. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long document understanding[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2411.01106.
|
| 32 |
CHO J, MAHATA D, IRSOY O, et al. M3DocRAG: multi-modal retrieval is what you need for multi-page multi-document understanding[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2411.04952.
|
| 33 |
JAIN C, WU Y R, ZENG Y F, et al. SimpleDoc: multi-modal document understanding with dual-cue page retrieval and iterative refinement[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2506.14035.
|
| 34 |
ZHAO D F. FRAG: toward federated vector database management for collaborative and secure retrieval-augmented generation[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2410.13272.
|
| 35 |
CHEN C , PETTERSON S , PHILLIPS R L , et al. Toward Graduate Medical Education (GME) accountability: measuring the outcomes of GME institutions. Academic Medicine, 2013, 88 (9): 1267- 1280.
doi: 10.1097/ACM.0b013e31829a3ce9
|
| 36 |
LIU P, LIU X, YAO R Y, et al. HM-RAG: hierarchical multi-agent multimodal retrieval augmented generation[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 2781-2790.
|
| 37 |
|
| 38 |
SURI M, MATHUR P, DERNONCOURT F, et al. VisDoM: multi-document QA with visually rich elements using multimodal retrieval-augmented generation[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2025: 6088-6109.
|
| 39 |
|
| 40 |
WANG Q C, DING R X, ZENG Y, et al. VRAG-RL: empower vision-perception-based RAG for visually rich information understanding via iterative reasoning with reinforcement learning[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2505.22019.
|
| 41 |
YOHANNES H M, MAHMOUD Y, NAZEERUDDIN M, et al. Multimodal Retrieval and Fusion Framework (MRaFF)[C]//Proceedings of the 8th International Conference on Information and Computer Technologies (ICICT). Washington D.C., USA: IEEE Press, 2025: 186-191.
|
| 42 |
王合庆, 魏杰, 景红雨, 等. Meta-RAG: 基于元数据驱动的电力领域检索增强生成框架. 计算机工程, 2026, 52 (2): 383- 392.
doi: 10.19678/j.issn.1000-3428.0070415
|
|
WANG H Q , WEI J , JING H Y , et al. Meta-RAG: a metadata-driven retrieval-augmented generation framework for the power industry. Computer Engineering, 2026, 52 (2): 383- 392.
doi: 10.19678/j.issn.1000-3428.0070415
|
| 43 |
|
| 44 |
BENCHAREF R, RAHICHE A, CHERIET M. DIVE-Doc: downscaling foundational image visual encoder into hierarchical architecture for DocVQA[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Washington D.C., USA: IEEE Press, 2026: 7597-7606.
|
| 45 |
YU W H, CHEN W, QI G Q, et al. BBox DocVQA: a large scale bounding box grounded dataset for enhancing reasoning in document visual question answer[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2511.15090.
|
| 46 |
KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Philadelphia, USA: Association for Computational Linguistics, 2020: 6769-6781.
|
| 47 |
GAO S S, ZHAO S S, JIANG X, et al. Scaling beyond context: a survey of multimodal retrieval-augmented generation for document understanding[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2510.15253.
|
| 48 |
ZHAO S Y, YANG Y Q, WANG Z L, et al. Retrieval Augmented Generation (RAG) and beyond: a comprehensive survey on how to make your LLMs use external data more wisely[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2409.14924.
|
| 49 |
SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 14974-14983.
|
| 50 |
|
| 51 |
WANG D S, RAMAN N, SIBUE M, et al. DocLLM: a layout-aware generative language model for multimodal document understanding[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2024: 8529-8548.
|
| 52 |
LEE C Y, LI C L, DOZAT T, et al. FormNet: structural encoding beyond sequential modeling in form document information extraction[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2203.08411.
|
| 53 |
|
| 54 |
WANG P, BAI S, TAN S N, et al. Qwen2-VL: enhancing vision-language model's perception of the world at any resolution[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2409.12191.
|
| 55 |
|
| 56 |
WU Q Y, BANSAL G, ZHANG J Y, et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2308.08155.
|
| 57 |
LOCKARD C, SHIRALKAR P, DONG X L, et al. ZeroShotCeres: zero-shot relation extraction from semi-structured webpages[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2005.07105.
|
| 58 |
WANG J P, JIN L W, DING K. LiLT: a simple yet effective language-independent layout Transformer for structured document understanding[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2202.13669.
|
| 59 |
EDGE D, TRINH H, CHENG N, et al. From local to global: a graph RAG approach to query-focused summarization[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2404.16130.
|
| 60 |
NGUYEN T, CHIN P, TAI Y W. MA-RAG: multi-agent retrieval-augmented generation via collaborative chain-of-thought reasoning[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2505.20096.
|
| 61 |
PINEDA V, DAYAN N, DE LARA E. Poster: leveraging geo-spatiality in geo-distributed vector databases[C]//Proceedings of the 10th ACM/IEEE Symposium on Edge Computing. New York, USA: ACM Press, 2025: 1-3.
|
| 62 |
WAGLE S, MUNIKOTI S, MEYUR R, et al. Leveraging multimodal AI for efficient data discovery in wind energy research[C]//Proceedings of Practice and Experience in Advanced Research Computing 2025: the Power of Collaboration. New York, USA: ACM Press, 2025: 1-3.
|
| 63 |
MOON J, HONG C. Multimodal clinical decision support for melanoma diagnosis using retrieval-augmented generation and vision-language models[C]//Proceedings of the IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS). Washington D.C., USA: IEEE Press, 2025: 1-6.
|
| 64 |
KEITA M, HAMIDOUCHE W, EUTAMENE H B, et al. REVEAL: a retrieval-augmented generation approach for contextual identification of synthetic visual content[C]//Proceedings of the 1st on Deepfake Forensics Workshop: Detection, Attribution, Recognition, and Adversarial Challenges in the Era of AI-Generated Media. New York, USA: ACM Press, 2025: 12-20.
|
| 65 |
ZENG Q X. Retrieval augmented 3D garment generation from single image[C]//Proceedings of the 33rd ACM International Conference on Multimedia. New York, USA: ACM Press, 2025: 9648-9656.
|
| 66 |
MAO J B , ZHENG C F , LIU W L , et al. MGRAG: Multimodal grid-aware retrieval augmentation generation framework for power grid work tickets. Pattern Recognition, 2026, 169, 111845.
doi: 10.1016/j.patcog.2025.111845
|
| 67 |
GONG Z Y, MAI C C, HUANG Y H. MHier-RAG: multi-modal RAG for visual-rich document question-answering via hierarchical and multi-granularity reasoning[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2508.00579.
|
| 68 |
YUAN X, NING L B, FAN W Q, et al. mKG-RAG: multimodal knowledge graph-enhanced RAG for visual question answering[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2508.05318.
|
| 69 |
|
| 70 |
YU B H, WU G W, YAO Z Y, et al. Beyond relevance: utility-driven retrieval for visual document question answering[C]//Proceedings of International Conference on Intelligent Computing. Singapore: Springer, 2025: 382-393.
|
| 71 |
CHOI Y, PARK J, YOON J, et al. Zero-shot multimodal document retrieval via cross-modal question generation[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics 2025: 26068-26083.
|
| 72 |
KHATTAB O, ZAHARIA M. ColBERT: efficient and effective passage search via contextualized late interaction over BERT[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2020: 39-48.
|
| 73 |
WANG S N, ZHAO Y J, XIE Y L, et al. Towards reliable vector database management systems: a software testing roadmap for 2030[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2502.20812.
|
| 74 |
XU M J, DONG J H, HOU J, et al. MM-R5: MultiModal reasoning-enhanced ReRanker via reinforcement learning for document retrieval[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2506.12364.
|
| 75 |
ASAI A, WU Z Q, WANG Y Z, et al. Self-RAG: learning to retrieve, generate, and critique through self-reflection[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2310.11511.
|
| 76 |
CHANG Y S, CAO G H, NARANG M, et al. WebQA: multihop and multimodal QA[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 16474-16483.
|
| 77 |
LERNER P, FERRET O, GUINAUDEAU C, et al. ViQuAE, a dataset for knowledge-based visual question answering about named entities[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2022: 3108-3120.
|
| 78 |
SINGH H, NASERY A, MEHTA D, et al. MIMOQA: multimodal input multimodal output question answering[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Philadelphia, USA: Association for Computational Linguistics, 2021: 5317-5332.
|
| 79 |
|
| 80 |
MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 3190-3199.
|
| 81 |
SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of ECCV'22. Berlin, Germany: Springer, 2022: 146-162.
|
| 82 |
SHAH S, MISHRA A, YADATI N, et al. KVQA: knowledge-aware visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 8876-8884.
|
| 83 |
WANG P , WU Q , SHEN C H , et al. FVQA: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40 (10): 2413- 2427.
doi: 10.1109/TPAMI.2017.2754246
|
| 84 |
JAIN A, KOTHYARI M, KUMAR V, et al. Select, substitute, search: a new benchmark for knowledge-augmented visual question answering[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2021: 2491-2498.
|
| 85 |
MATHEW M, BAGAL V, TITO R, et al. InfographicVQA[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2022: 2582-2591.
|
| 86 |
TANAKA R, NISHIDA K, NISHIDA K, et al. SlideVQA: a dataset for document visual question answering on multiple images[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 13636-13645.
|
| 87 |
MASRY A, LONG D X, TAN J Q, et al. ChartQA: a benchmark for question answering about charts with visual and logical reasoning[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022. Philadelphia, USA: Association for Computational Linguistics, 2022: 2263-2279.
|
| 88 |
ZHU F B, LEI W Q, FENG F L, et al. Towards complex document understanding by discrete reasoning[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 4857-4866.
|
| 89 |
VAN LANDEGHEM J, POWALSKI R, TITO R, et al. Document Understanding Dataset and Evaluation (DUDE)[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 19471-19483.
|
| 90 |
CHELLAPPA R, PRAMANICK S, VENUGOPALAN S. SPIQA: a dataset for multimodal question answering on scientific papers[C]//Proceedings of the Advances in Neural Information Processing Systems. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 118807-118833.
|
| 91 |
TITO R , KARATZAS D , VALVENY E . Hierarchical multimodal Transformers for multipage DocVQA. Pattern Recognition, 2023, 144, 109834.
doi: 10.1016/j.patcog.2023.109834
|
| 92 |
SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 8309-8318.
|
| 93 |
MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). Washington D.C., USA: IEEE Press, 2020: 947-952.
|
| 94 |
CHEN X Y, ZHAO Z H, CHEN L, et al. WebSRC: a dataset for Web-based structural reading comprehension[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: ACL Press, 2021: 4173-4185.
|
| 95 |
LI B H, GE Y Y, CHEN Y, et al. SEED-Bench-2-Plus: benchmarking multimodal large language models with text-rich visual comprehension[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2404.16790.
|
| 96 |
GOYAL Y , KHOT T , AGRAWAL A , et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 2019, 127 (4): 398- 414.
doi: 10.1007/s11263-018-1116-0
|
| 97 |
ZELLERS R, BISK Y, FARHADI A, et al. From recognition to cognition: visual commonsense reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 6713-6724.
|
| 98 |
LIU Y, DUAN H D, ZHANG Y H, et al. MMBench: is your multi-modal model an all-around player?[C]// Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2025: 216-233.
|
| 99 |
LU P, MISHRA S, XIA T, et al. Learn to explain: multimodal reasoning via thought chains for science question answering[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2209.09513.
|
| 100 |
LU P, BANSAL H, XIA T, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2310.02255.
|
| 101 |
|
| 102 |
FU C Y, DAI Y H, LUO Y D, et al. Video-MME: the first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 24108-24118.
|
| 103 |
YU Z, XU D J, YU J, et al. ActivityNet-QA: a dataset for understanding complex Web videos via question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 9127-9134.
|
| 104 |
MANGALAM K, AKSHULAKOV R, MALIK J. EgoSchema: a diagnostic benchmark for very long-form video language understanding[EB/OL]. [2025-12-27]. https://arxiv.org/abs/2308.09126.
|
| 105 |
|
| 106 |
LIU Y L , LI Z , HUANG M X , et al. OCRBench: on the hidden mystery of OCR in large multimodal models. Science China Information Sciences, 2024, 67 (12): 220102.
doi: 10.1007/s11432-024-4235-6
|
| 107 |
HAN X , LI Z , CAO H , et al. Multimodal spatio-temporal data visualization technologies for contemporary urban landscape architecture: a review and prospect in the context of smart cities. Land, 2025, 14 (5): 1069.
doi: 10.3390/land14051069
|
| 108 |
RHAIEM M A B , SELMI M , FARAH I R , et al. Leveraging volunteered geographical information and spatio temporal big data in disaster management: opportunity and challenges. International Journal of Data Science and Analytics, 2025, 21 (1): 25.
|
| 109 |
KUMAR R , BHANU M , MENDES-MOREIRA J , et al. Spatio-temporal predictive modeling techniques for different domains: a survey. ACM Computing Surveys, 2025, 57 (2): 1- 42.
|
| 110 |
CAO Y , STEFFEY S , HE J B , et al. Medical image retrieval: a multimodal approach. Cancer Informatics, 2014, 13 (3): 125- 136.
|
| 111 |
SHAIK T , TAO X H , LI L , et al. A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom. Information Fusion, 2024, 102, 102040.
doi: 10.1016/j.inffus.2023.102040
|
| 112 |
CHEN Y , GE X K , YANG S L , et al. A survey on multimodal knowledge graphs: construction, completion and applications. Mathematics, 2023, 11 (8): 1815.
doi: 10.3390/math11081815
|
| 113 |
WEN J , ZHANG X , RUSH E , et al. Multimodal representation learning for predicting molecule—disease relations. Bioinformatics, 2023, 39 (2): btad085.
doi: 10.1093/bioinformatics/btad085
|
| 114 |
KREUTZ C K , SCHENKEL R . Scientific paper recommendation systems: a literature review of recent publications. International Journal on Digital Libraries, 2022, 23 (4): 335- 369.
doi: 10.1007/s00799-022-00339-w
|