Deepfake Video Detection Method Based on Multi-Scale Spatiotemporal Feature Fusion

doi:10.19678/j.issn.1000-3428.0253108

Abstract

Abstract: The rapid development of deepfake technology in recent years brings new opportunities in fields such as entertainment and education, but also causes serious cybersecurity and privacy issues. Current deepfake video detection methods face two main challenges. First, encoding artifacts and noise in low-quality and highly compressed videos can hide subtle forgery traces. Second, existing approaches have difficulty modeling temporal inconsistencies between video frames and lack deep fusion of spatiotemporal features. To solve these problems, this paper proposes a detection model called MSST based on multi-scale spatiotemporal feature fusion. The method builds a complete framework with multi-scale spatial feature extraction, frequency-domain feature enhancement, and multi-scale temporal feature extraction. First, a multi-scale Transformer encoder extracts spatial features at different levels. A learnable frequency-domain filter is used to improve the detection of high-frequency forgery traces. At the same time, a multi-scale temporal Transformer models temporal inconsistencies between frames to capture short- and long-range dynamic anomalies. The model also designs a gated cross-attention module to fuse spatiotemporal features. This module enables dynamic cross-modal interaction and produces more discriminative fused features. Tests on the FF++ (LQ), Celeb-DF, and DFDC datasets show that MSST achieves ACC scores of 92.73%, 96.61%, and 95.15%, and AUC scores of 0.965, 0.981, and 0.976. Compared to current mainstream methods, the proposed approach gives better accuracy and generalization.

摘要： 近年来，深度伪造技术的快速发展在带来娱乐、教育等行业新机遇的同时，也引发了严重的网络安全与隐私问题。当前深度伪造视频检测技术面临两大核心挑战：一是在低质量、高压缩视频中，编码失真与噪声干扰会掩盖细微的伪造痕迹；二是现有方法难以有效建模视频帧间的时序不一致性，缺乏对时空特征的深度融合。针对上述问题，研究提出了一种基于多尺度时空特征融合的检测模型（MSST），该方法构建了一个包含多尺度空间特征提取、频域特征增强和多尺度时间特征提取的完整框架。首先，利用多尺度Transformer编码器提取不同层次的空间特征，并引入可学习频域滤波器以增强高频伪造痕迹的鲁棒性。同时，通过多尺度时间Transformer建模视频帧间时序不一致性，捕捉短程与长程动态异常。在此基础上，设计了一种基于门控交叉注意力的时空特征融合模块，实现跨模态的动态交互，从而生成更具判别力的融合特征。在FF++（LQ）、Celeb-DF与DFDC数据集上的实验结果显示，MSST的ACC和AUC分别达到92.73%、96.61%、95.15%和0.965、0.981、0.976。与现有主流方法相比，该方法在精确度和泛化性上均有明显的提升。

Tian Feng, Li Xiang , Liu Fang, Zhang Yan, Xie Hongtao, Han Yuxiang, Fang Chao. Deepfake Video Detection Method Based on Multi-Scale Spatiotemporal Feature Fusion[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253108.

田枫, 李翔 , 刘芳, 张岩, 解红涛, 韩玉祥, 方超. 基于多尺度时空特征融合的深度伪造视频检测方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253108.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0253108

References

[1] 杨睿,胡心如,黄卓超,等. 深度网络生成式伪造人脸检测方法研究综述[J].计算机辅助设计与图形学学报,2024,36(10):1491-1510. Yang R, Hu X R, Huang Z C, et al. A Survey on Detection Methods for Deep Network-Generated Fake Faces[J]. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(10): 1491-1510.
[2] 祝恺蔓,徐文博,卢伟,等. 多关键帧特征交互的人脸篡改视频检测[J].中国图象图形学报,2022,27(01):188-202. Zhu K M, Xu W B, Lu W, et al. Face forgery video detection based on multi-keyframe feature interaction[J]. Journal of Image and Graphics, 2022, 27(01): 188-202.
[3] 刘伟东,马晓飞,刘硕.深度伪造技术洞察及风险治理[J].科技智囊,2025,(08):29-36.DOI:10.19881/j.cnki.1006-3676.2025.08.04. Liu W D, Ma X F, Liu S. Insights into Deepfake Technology and Risk Governance[J]. Science and Technology Think Tank, 2025, (08): 29-36. DOI:10.19881/j.cnki.1006-3676.2025.08.04.
[4] 张弛,赵怡姿.数字犯罪风险的整体形势与应对策略[J].服务外包,2025,(08):25-29. Zhang C, Zhao Y Z. The Overall Situation of Digital Crime Risks and Countermeasures[J]. Service Outsourcing, 2025, (08): 25-29.
[5] PassosA L ,JodasD ,CostaP A K , et al.A review of deep learning‐based approaches for deepfake content detection[J].Expert Systems,2024,41(8):DOI:10.1111/EXSY.13570.
[6] Gong L Y , Li X J .A Contemporary Survey on Deepfake Detection: Datasets, Algorithms, and Challenges[J].Electronics, 2024, 13(3):22.DOI:10.3390/electronics13030585.
[7] 耿浩琦,张建岭,丁博文. 基于轻量级光流法的深度伪造视频检测方法[J].中国人民公安大学学报(自然科学版),2025,31(02):74-85. Geng H Q, Zhang J L, Ding B W. Deepfake Video Detection Method Based on Lightweight Optical Flow[J]. Journal of People's Public Security University of China (Natural Science Edition), 2025, 31(02): 74-85.
[8] S. A ,P. V ,G. V M . A defensive framework for deepfake detection under adversarial settings using temporal and spatial features[J].International Journal of Information Security,2023,22(5):1371-1382.
[9] Pandey R, Kushwaha A K S. Detecting deepfake videos: an enhanced hybrid deep learning model[J]. Signal, Image and Video Processing, 2025, 19(9): 763.
[10] Matern F, Riess C, Stamminger M. Exploiting visual artifacts to expose deepfakes and face manipulations[C]//2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019: 83-92.
[11] Rossler A, Cozzolino D, Verdoliva L, et al. Faceforensics++: Learning to detect manipulated facial images[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1-11.
[12] Deepfakes. 2018. github. https://github.com/deepfakes/faceswap.
[13] Thies J, Zollhofer M, Stamminger M, et al. Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.
[14] Faceswap. 2018. github. https://github.com/MarekKowalski/FaceSwap/.
[15] Thies J, Zollhöfer M, Nießner M. Deferred neural rendering: Image synthesis using neural textures[J]. Acm Transactions on Graphics (TOG), 2019, 38(4): 1-12.
[16] Yang X, Li Y, Lyu S. Exposing deep fakes using inconsistent head poses[C]//ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2019: 8261-8265.
[17] Saif S ,Tehseen S ,Ali S S .Fake news or real? Detecting deepfake videos using geometric facial structure and graph neural network[J].Technological Forecasting & Social Change,2024,205123471-.DOI:10.1016/J.TECHFORE.2024.123471.
[18] Peng C , Miao Z , Liu D ,et al.Where Deepfakes Gaze at? Spatial–Temporal Gaze Inconsistency Analysis for Video Face Forgery Detection[J].Information Forensics and Security, IEEE Transactions on, 2024, 19(000):4507-4517.DOI:10.1109/TIFS.2024.3381823.
[19] Tan C , Liu H , Zhao Y ,et al.Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection[J].IEEE, 2023.DOI:10.1109/CVPR52733.2024.02657.
[20] Nguyen D , Mejri N , Singh I P ,et al.LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection[J].IEEE, 2024.DOI:10.1109/CVPR52733.2024.01647.
[21] Long M , Liu Z , Zhang L B ,et al.LGDF-Net: Local and Global Feature Based Dual-Branch Fusion Networks for Deepfake Detection[J].IEEE Transactions on Circuits and Systems for Video Technology, 2025:1-1.DOI:10.1109/tcsvt.2025.3530402.
[22] Yu Y, Zhao X, Ni R, et al. Augmented multi-scale spatiotemporal inconsistency magnifier for generalized deepfake detection[J]. IEEE Transactions on Multimedia, 2023, 25: 8487-8498.
[23] Lu W , Liu L , Zhang B ,et al.Detection of Deepfake Videos Using Long-Distance Attention[J].IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(7):14.DOI:10.1109/TNNLS.2022.3233063.
[24] Tran V N, Le H S, Choi P, et al. MEViT: Generalization of Deepfake Detection with Meta-Learning EfficientNet Vision Transformer[J]. IEEE Open Journal of the Computer Society, 2025.
[25] Zhao C, Wang C, Hu G, et al. ISTVT: interpretable spatial-temporal video transformer for deepfake detection[J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1335-1348.
[26] Miao C, Tan Z, Chu Q, et al. F 2 trans: High-frequency fine-grained transformer for face forgery detection[J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1039-1051.
[27] Li Y, Yang X, Sun P, et al. Celeb-df: A large-scale challenging dataset for deepfake forensics[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 3207-3216.
[28] Dolhansky B, Howes R, Pflaum B, et al. The deepfake detection challenge (dfdc) preview dataset[J]. arxiv preprint arxiv:1910.08854, 2019.
[29] Deng J, Guo J, Ververas E, et al. Retinaface: Single-shot multi-level face localisation in the wild[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 5203-5212.
[30] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.
[31] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114.
[32] Chollet F. Xception: Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1251-1258.
[33] Afchar D, Nozick V, Yamagishi J, et al. Mesonet: a compact facial video forgery detection network[C]//2018 IEEE international workshop on information forensics and security (WIFS). IEEE, 2018: 1-7.
[34] Nguyen H H, Yamagishi J, Echizen I. Capsule-forensics: Using capsule networks to detect forged images and videos[C]//ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2019: 2307-2311.
[35] Li L, Bao J, Zhang T, et al. Face x-ray for more gen eral face forgery detection[C]//Proceedings of the IEE E/CVF conference on computer vision and pattern rec ognition. 2020: 5001-5010.
[36] Qian Y, Yin G, Sheng L, et al. Thinking in frequency: Face forgery detection by mining frequency-aware clues[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 86-103.
[37] Zhang D, Li C, Lin F, et al. Detecting Deepfake Videos with Temporal Dropout 3DCNN[C]//IJCAI. 2021: 1288-1294.
[38] Zhao H, Zhou W, Chen D, et al. Multi-attentional deepfake detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 2185-2194.
[39] Xu Y, Liang J, Jia G, et al. Thumbnail Layout for Deepfake Video Detection. In 2023 IEEE[C]//CVF International Conference on Computer Vision (ICCV). 2023.
[40] Xu Y, Liang J, Sheng L, et al. Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection[J]. International Journal of Computer Vision, 2024, 132(12): 5663-5680.
[41] Cheng J, Yan Z, Zhang Y, et al. Can we leave deepfake data behind in training deepfake detector?[J]. Advances in Neural Information Processing Systems, 2024, 37: 21979-21998.
[42] Han Y H, Huang T M, Hua K L, et al. Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 22995-23005.
[43] Yan Z, Wang J, Jin P, et al. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection[J]. arXiv preprint arXiv:2411.15633, 2024.
[44] Zhou P, Han X, Morariu V I, et al. Two-stream neural networks for tampered face detection[C]//2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, 2017: 1831-1839.
[45] Cheng Z ,Wang Y ,Wan Y , et al.DeepFake detection method based on multi-scale interactive dual-stream network[J].Journal of Visual Communication and Image Representation,2024,104104263-104263.DOI:10.1016/J.JVCIR.2024.104263.
[46] Zhou W, Luo X, Zhang Z, et al. Capture Artifacts via Progressive Disentangling and Purifying Blended Identities for Deepfake Detection[J]. arxiv preprint arxiv:2410.10244, 2024.
[47] Shao R, Wu T, Nie L, et al. Deepfake-adapter: Dual-level adapter for deepfake detection[J]. International Journal of Computer Vision, 2025, 133(6): 3613-3628.
[48] Yan Z, Zhao Y, Chen S, et al. Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 12615-12625.

Please choose a citation manager

Content to export