多模态虚假信息检测综述

doi:10.19678/j.issn.1000-3428.0253388

摘要/Abstract

摘要： 数字时代下，文本、图像、音频等模态的复杂交互形成了多模态虚假信息，其传播速度与隐蔽程度远超传统单模态虚假信息，对信息安全与社会治理构成严峻挑战。但在国内，该领域相关研究较为匮乏，尚未形成完整体系。为此研究系统梳理了多模态虚假信息检测领域的研究现状及发展脉络，对多模态虚假信息检测的研究进行了全面总结。在明确多模态虚假信息检测的核心概念与任务谱系的基础上，详细总结了数据集与测评指标特征，分析了SAFE、CAFE、CFFN、SSA-MFND、PSCC-Net、DGM4、CCN、SNIFFER、KGAlign等不同多模态方法模型的适用场景与检测性能，归纳了跨模态一致性、异常特征识别、外部事实驱动三大核心检测方法，并且对多模态虚假信息检测的可解释性与泛化鲁棒性进行了探讨。同时，随着大规模视觉语言模型LVLM的崛起，其在多模态虚假信息检测中的应用不断深化，对此研究梳理了LVLM在该领域的多种应用场景、优势与局限。最后展望了多模态虚假信息检测的未来研究方向，以期为多模态虚假信息检测领域的发展提供借鉴与启示。

Abstract: In the digital era, the complex interactions between modalities such as text, images, and audio have given rise to multimodal misinformation. Its propagation speed and concealment level far exceed those of traditional unimodal misinformation, posing severe challenges to information security and social governance. However, research in this field is relatively scarce in China, and a comprehensive framework has yet to be established. Therefore, this study systematically reviews the research status and development trajectory of multimodal misinformation detection, providing a comprehensive summary of this field. Based on a clear understanding of the core concepts and task spectrum of multimodal misinformation detection, the study details the characteristics of datasets and evaluation metrics. It also analyzes the applicability and detection performance of different multimodal methods and models, such as SAFE, CAFE, CFFN, SSA-MFND, PSCC-Net, DGM4, CCN, SNIFFER, and KGAlign. The study summarizes three core detection methods: cross-modal consistency, anomaly feature recognition, and external fact-driven approaches. Furthermore, it explores the interpretability and generalization robustness of multimodal misinformation detection. With the rise of large-scale visual-language models (LVLM), their application in multimodal misinformation detection is continuously deepening. This study reviews various application scenarios, advantages, and limitations of LVLMs in this domain. Finally, the paper outlines future research directions in multimodal misinformation detection, aiming to provide insights and inspiration for the further development of this field.

郝冠一, 孙靖超. 多模态虚假信息检测综述[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253388.

HAO Guanyi, SUN Jingchao. Review of Multimodal False Information Detection[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253388.

参考文献

[1] 张欣, 孙靖超. 基于大语言模型的虚假信息检测框架综述[J]. 计算机科学与探索, 2025, 19(6): 1414-1436. ZHANG Xin, SUN Jingchao. Survey on Misinformation Detection Framework Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1414-1436.
[2] VOSOUGHI S, ROY D, ARAL S. The spread of true and false news online[J/OL]. Science, 2018[2025-12-01]. https://www.science.org/doi/10.1126/science.aap9559. DOI:10.1126/science.aap9559.
[3] CASTILLO C, MENDOZA M, POBLETE B. Information credibility on twitter[C/OL]//Proceedings of the 20th International Conference on World Wide Web. 2011: 675-684[2025-12-01]. https://dl.acm.org/doi/10.1145/1963405.1963500.
[4] ZHOU P, HAN X, MORARIU V I, et al. Learning rich features for image manipulation detection[C/OL]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1053-1061[2025-12-01]. https://openaccess.thecvf.com/content_cvpr_2018/html/Zhou_Learning_Rich_Features_CVPR_2018_paper.html.
[5] ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. FaceForensics++: Learning to Detect Manipulated Facial Images[C/OL]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 1-11[2025-12-01]. https://openaccess.thecvf.com/content_ICCV_2019/html/Rossler_FaceForensics_Learning_to_Detect_Manipulated_Facial_Images_ICCV_2019_paper.html.
[6] WANG Y, MA F, JIN Z, et al. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection[C/OL]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery, 2018: 849-857[2025-12-01]. https://dl.acm.org/doi/10.1145/3219819.3219903. DOI:10.1145/3219819.3219903.
[7] ZHOU X, WU J, ZAFARANI R. SAFE: Similarity-Aware Multi-Modal Fake News Detection[R/OL]. arXiv, 2020[2025-09-15]. https://arxiv.org/abs/2003.04981.
[8] BRAUN T, ROTHERMEL M, ROHRBACH M, et al. DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts[R/OL]. arXiv, 2024[2026-01-12]. http://arxiv.org/abs/2412.10510.
[9] KAKIZAKI K, MATSUNAGA Y, FURUKAWA R. MAFT: Multimodal Automated Fact-Checking via Textualization[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(28): 29646-29648. DOI:10.1609/aaai.v39i28.35354.
[10] 许旻辰, 屈丹, 司念文, 等. 社交媒体虚假信息检测技术研究综述[J/OL]. 计算机工程, 2025: 1-20. DOI:10.19678/j.issn.1000-3428.0070287. XU Minchen, QU Dan, SI Nianwen, et al. A Survey of Research on Social Media Disinformation Detection Technologies [J/OL]. Computer Engineering, 2025: 1-20. DOI: 10.19678/j.issn.1000-3428.0070287.
[11] RADFORD A, KIM J W, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[C/OL]//Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021: 8748-8763[2025-12-01]. https://proceedings.mlr.press/v139/radford21a.html.
[12] LI J, LI D, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[R/OL]. arXiv, 2023[2025-09-22]. http://arxiv.org/abs/2301.12597. DOI:10.48550/arXiv.2301.12597.
[13] ZHANG S, FANG Q, YANG Z, et al. LLaVA-Mini: Efficient image and video large multimodal models with one vision token[R/OL]. arXiv, 2025[2025-12-02]. http://arxiv.org/abs/2501.03895. DOI:10.48550/arXiv.2501.03895.
[14] DAI W, LI J, LI D, et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning[R/OL]. arXiv, 2023[2025-12-02]. http://arxiv.org/abs/2305.06500. DOI:10.48550/arXiv.2305.06500.
[15] LIU H, XUE W, CHEN Y, et al. A survey on hallucination in large vision-language models[R/OL]. arXiv, 2024[2025-12-01]. http://arxiv.org/abs/2402.00253. DOI:10.48550/arXiv.2402.00253.
[16] ALAM F, CRESCI S, CHAKRABORTY T, et al. A Survey on Multimodal Disinformation Detection[C/OL]//CALZOLARI N, HUANG C R, KIM H, et al. Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, 2022: 6625-6643[2025-12-01]. https://aclanthology.org/2022.coling-1.576/.
[17] LUO G, DARRELL T, ROHRBACH A. NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media[C/OL]//MOENS M F, HUANG X, SPECIA L, et al. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021: 6801-6817[2025-12-01]. https://aclanthology.org/2021.emnlp-main.545/. DOI:10.18653/v1/2021.emnlp-main.545.
[18] ANEJA S, BREGLER C, NIESSNER M. Catching out-of-context misinformation with self-supervised learning[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 1342-1351[2025-12-01].
[19] TOLOSANA R, VERA-RODRIGUEZ R, FIERREZ J, et al. DeepFakes and beyond: A survey of face manipulation and fake detection[J/OL]. Information Fusion, 2020, 64: 131-148. DOI:10.1016/j.inffus.2020.06.014.
[20] 向旺, 王金光, 王一飞, 等. 基于多模态双协同Gather Transformer网络的虚假信息检测方法[J]. 计算机科学, 2024, 51(12): 242-249. XIANG Wang, WANG Jinguang, WANG Yifei, et al. Misinformation Detection Method Based on Multimodal Dual-Collaborative Gather Transformer Network [J]. Computer Science, 2024, 51(12): 242-249.
[21] POPAT K, MUKHERJEE S, STRÖTGEN J, et al. CredEye: A Credibility Lens for Analyzing and Explaining Misinformation[C/OL]//Companion Proceedings of the The Web Conference 2018. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2018: 155-158[2025-12-01]. https://dl.acm.org/doi/10.1145/3184558.3186967. DOI:10.1145/3184558.3186967.
[22] SHU K, CUI L, WANG S, et al. dEFEND: Explainable fake news detection[C/OL]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery, 2019: 395-405[2025-12-01]. https://dl.acm.org/doi/10.1145/3292500.3330935.
[23] TAN R, PLUMMER B, SAENKO K. Detecting cross-modal inconsistency to defend against neural fake news[C/OL]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg: Association for Computational Linguistics, 2020: 2081-2106[2025-12-01]. https://aclanthology.org/2020.emnlp-main.163/.
[24] WANG S Y, WANG O, ZHANG R, et al. CNN-generated images are surprisingly easy to spot… for now[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2020: 8692-8701[2025-12-01]. https://ieeexplore.ieee.org/document/9156876.
[25] GUO H, MA Z, ZENG Z, et al. Each fake news is fake in its own way: An attribution multi-granularity benchmark for multimodal fake news detection[R/OL]. arXiv, 2024[2025-09-15]. http://arxiv.org/abs/2412.14686.
[26] CUI X, APARCEDO A, JANG Y K, et al. On the robustness of large multimodal models against image adversarial attacks[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society, 2024: 24625-24634[2025-12-02]. https://openaccess.thecvf.com/content/CVPR2024/html/Cui_On_the_Robustness_of_Large_Multimodal_Models_Against_Image_Adversarial_CVPR_2024_paper.html.
[27] JIN Z, CAO J, GUO H, et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C/OL]//Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 829-837[2025-12-01]. https://dl.acm.org/doi/10.1145/3123266.3123454.
[28] NAKAMURA K, LEVY S, WANG W Y. Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection[C/OL]//Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 2020: 6149-6157[2025-12-01]. https://aclanthology.org/2020.lrec-1.755/.
[29] SABIR E, ABDALMAGEED W, WU Y, et al. Deep multimodal image-repurposing detection[C/OL]//Proceedings of the 26th ACM International Conference on Multimedia. New York: ACM, 2018: 1337-1345[2025-09-15]. http://arxiv.org/abs/1808.06686.
[30] SHAO R, WU T, LIU Z. Detecting and grounding multi-modal media manipulation[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society, 2023: 6904-6913[2025-12-01]. https://openaccess.thecvf.com/content/CVPR2023/html/Shao_Detecting_and_Grounding_Multi-Modal_Media_Manipulation_CVPR_2023_paper.html.
[31] QI P, BU Y, CAO J, et al. FakeSV: A multimodal benchmark with rich social context for fake news detection on short video platforms[C/OL]//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, DC: AAAI Press, 2023: 14444-14452. DOI:10.1609/aaai.v37i12.26689.
[32] BOIDIDOU C, ANDREADOU K, PAPADOPOULOS S, et al. Verifying multimedia use at mediaeval 2015[C/OL]//CEUR Workshop Proceedings. Aachen: CEUR-WS, 2015: Vol-1436. https://ceur-ws.org/Vol-1436/Paper31.pdf.
[33] NAN Q, CAO J, ZHU Y, et al. MDFEND: Multi-domain fake news detection[C/OL]//Proceedings of the 30th ACM International Conference on Information & Knowledge Management. New York: ACM, 2021: 3343-3347. DOI:10.1145/3459637.3482139.
[34] CHEN Y, LI D, ZHANG P, et al. Cross-modal ambiguity learning for multimodal fake news detection[C/OL]//Proceedings of the ACM Web Conference 2022. New York: ACM, 2022: 2897-2905. DOI:10.1145/3485447.3511968.
[35] LI J, BIN Y, ZOU J, et al. Cross-modal consistency learning with fine-grained fusion network for multimodal fake news detection[R/OL]. arXiv, 2023. DOI:10.48550/arXiv.2311.01807.
[36] SHANG W, SONG K, JI J, et al. Semantic space aligned multimodal fake news detection[J/OL]. Information Fusion, 2025, 125: 103469. DOI:10.1016/j.inffus.2025.103469.
[37] HUANG L, WU J, HUANG J, et al. SAFE-GTA: Semantic augmentation-based multimodal fake news detection via global-token attention[J/OL]. Symmetry, 2025, 17(6): 961. https://doi.org/10.3390/sym17060961.
[38] LIU X, LIU Y, CHEN J, et al. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and localization[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 12405-12414.
[39] KIM T, JEONG Y, CHOI J, et al. Beyond spatial frequency: pixel-wise temporal frequency-based deepfake video detection[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 28221-28231.
[40] ZHANG B, YIN Q, LU W, et al. Deepfake detection and localization using multi-view inconsistency measurement[J/OL]. IEEE Transactions on Dependable and Secure Computing, 2025, 22(02): 1796-1809. DOI:10.1109/TDSC.2024.3472064.
[41] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[C/OL]//Advances in Neural Information Processing Systems: Vol. 33. Red Hook, NY: Curran Associates, Inc., 2020: 12449-12460[2026-01-13].
[42] LEE S, CHOI S, KANG T, et al. iWAX: interpretable Wav2vec-AASIST-XGBoost framework for voice spoofing detection[J/OL]. Scientific Reports, 2025, 15(1): 40491. DOI:10.1038/s41598-025-24361-5.
[43] LIU W, SHE T, LIU J, et al. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 28261-28271.
[44] OORLOFF T, KOPPISETTI S, BONETTINI N, et al. AVFF: Audio-visual feature fusion for video deepfake detection[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 27102-27112.
[45] ASTRID M, GHORBEL E, AOUADA D. Audio-visual deepfake detection with local temporal inconsistencies[R/OL]. arXiv, 2025[2026-01-13]. http://arxiv.org/abs/2501.08137.
[46] ABDELNABI S, HASAN R, FRITZ M. Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 21240-21249.
[47] QI P, YAN Z, HSU W, et al. SNIFFER: Multimodal large language model for explainable out-of-context misinformation detection[C/OL]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 27043-27053.
[48] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C/OL]//Advances in Neural Information Processing Systems: Vol. 33. Red Hook, NY: Curran Associates, Inc., 2020: 9459-9474.
[49] HU X, GUO Z, CHEN J, et al. MR2: A benchmark for multimodal retrieval-augmented rumor detection in social media[C/OL]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2023: 2901-2912.
[50] CUI H, REN M, ZHENG P, et al. A cross-domain knowledge graph-based cognitive inspiration and alignment method towards innovative design[C/OL]//Proceedings of the 2025 IEEE 21st International Conference on Automation Science and Engineering (CASE). Piscataway, NJ: IEEE, 2025: 3361-3366[2025-12-01]. DOI:10.1109/CASE58245.2025.11164036.
[51] DONG J, WANG W, TAN T. CASIA image tampering detection evaluation database[C/OL]//Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing. Piscataway, NJ: IEEE, 2013: 422-426.
[52] GUAN H, KOZAK M, ROBERTSON E, et al. MFC datasets: large-scale benchmark datasets for media forensic challenge evaluation[C/OL]//Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). Piscataway, NJ: IEEE, 2019: 63-72. DOI:10.1109/WACVW.2019.00018.
[53] CUI X, ZOU Y, LI Z, et al. T^2Agent: A tool-augmented multimodal misinformation detection agent with Monte Carlo tree search[R/OL]. arXiv, 2025[2025-09-15]. http://arxiv.org/abs/2505.19768.
[54] MA J, GAO W, MITRA P, et al. Detecting rumors from microblogs with recurrent neural networks[C/OL]//Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI). Palo Alto: AAAI Press, 2016: 3818-3824.
[55] HARDALOV M, ARORA A, NAKOV P, et al. Few-shot cross-lingual stance detection with sentiment-based pre-training[C/OL]//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Washington, DC: AAAI Press, 2022: 10729-10737[2025-12-03].
[56] ZHANG W E, SHENG Q Z, ALHAZMI A, et al. Adversarial attacks on deep-learning models in natural language processing: a survey[J/OL]. ACM Transactions on Intelligent Systems and Technology, 2020, 11(3): 24:1-24:41. DOI:10.1145/3374217.
[57] CARLINI N, WAGNER D. Towards evaluating the robustness of neural networks[C/OL]//Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP). Piscataway, NJ: IEEE, 2017: 39-57[2025-12-02]. DOI:10.1109/SP.2017.49.
[58] MUKHERJEE A, GHOSH S. UNITE-FND: Reframing multimodal fake news detection through unimodal scene translation[R/OL]. arXiv, 2025[2025-09-15]. http://arxiv.org/abs/2502.11132.
[59] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C/OL]//Advances in Neural Information Processing Systems: Vol. 35. Red Hook, NY: Curran Associates, Inc., 2022: 23716-23736.
[60] LI Y, DU Y, ZHOU K, et al. Evaluating object hallucination in large vision-language models[C/OL]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 292-305. DOI:10.18653/v1/2023.emnlp-main.20.
[61] WEIDINGER L, MELLOR J, RAUH M, et al. Ethical and social risks of harm from language models[R/OL]. arXiv, 2021[2025-12-02]. http://arxiv.org/abs/2112.04359.
[62] MATERN F, RIESS C, STAMMINGER M. Exploiting visual artifacts to expose deepfakes and face manipulations[C/OL]//Proceedings of the 2019 IEEE Winter
[63] WANG Z, BAO J, ZHOU W, et al. DIRE for diffusion-generated image detection[C/OL]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ: IEEE, 2023: 22445-22455.
[64] C2PA. C2PA Specifications: V 1.0[EB/OL]. (2021-04-26)[2025-12-02]. https://spec.c2pa.org/specifications/specifications/1.0/index.html.
[65] VATSA M, JAIN A, SINGH R. Adventures of trustworthy vision-language models: a survey[C/OL]//Proceedings of the 38th AAAI Conference on Artificial Intelligence. Washington, DC: AAAI Press, 2024: 22650-22658. DOI:10.1609/aaai.v38i20.30275.
[66] BIAN T, XIAO X, XU T, et al. Rumor detection on social media with bi-directional graph convolutional networks[C/OL]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Washington, DC: AAAI Press, 2020: 549-556. DOI:10.1609/aaai.v34i01.5393.
[67] HAN Y, KARUNASEKERA S, LECKIE C. Continual learning for fake news detection from social media[C/OL]//Artificial Neural Networks and Machine Learning – ICANN 2021. Cham: Springer, 2021: 372-384. DOI:10.1007/978-3-030-86340-1_30.
[68] YAO S, ZHAO J, YU D, et al. ReAct: Synergizing reasoning and acting in language models[C/OL]//The Eleventh International Conference on Learning Representations. Online: OpenReview, 2023[2025-12-02]. https://openreview.net/forum?id=WE_vluYUL-X.

选择文件类型/文献管理软件名称

选择包含的内容