Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Deepfake Video Detection Method Based on Multi-Scale Spatiotemporal Feature Fusion

  

  • Published:2026-03-17

基于多尺度时空特征融合的深度伪造视频检测方法

Abstract: The rapid development of deepfake technology in recent years brings new opportunities in fields such as entertainment and education, but also causes serious cybersecurity and privacy issues. Current deepfake video detection methods face two main challenges. First, encoding artifacts and noise in low-quality and highly compressed videos can hide subtle forgery traces. Second, existing approaches have difficulty modeling temporal inconsistencies between video frames and lack deep fusion of spatiotemporal features. To solve these problems, this paper proposes a detection model called MSST based on multi-scale spatiotemporal feature fusion. The method builds a complete framework with multi-scale spatial feature extraction, frequency-domain feature enhancement, and multi-scale temporal feature extraction. First, a multi-scale Transformer encoder extracts spatial features at different levels. A learnable frequency-domain filter is used to improve the detection of high-frequency forgery traces. At the same time, a multi-scale temporal Transformer models temporal inconsistencies between frames to capture short- and long-range dynamic anomalies. The model also designs a gated cross-attention module to fuse spatiotemporal features. This module enables dynamic cross-modal interaction and produces more discriminative fused features. Tests on the FF++ (LQ), Celeb-DF, and DFDC datasets show that MSST achieves ACC scores of 92.73%, 96.61%, and 95.15%, and AUC scores of 0.965, 0.981, and 0.976. Compared to current mainstream methods, the proposed approach gives better accuracy and generalization.

摘要: 近年来,深度伪造技术的快速发展在带来娱乐、教育等行业新机遇的同时,也引发了严重的网络安全与隐私问题。当前深度伪造视频检测技术面临两大核心挑战:一是在低质量、高压缩视频中,编码失真与噪声干扰会掩盖细微的伪造痕迹;二是现有方法难以有效建模视频帧间的时序不一致性,缺乏对时空特征的深度融合。针对上述问题,研究提出了一种基于多尺度时空特征融合的检测模型(MSST),该方法构建了一个包含多尺度空间特征提取、频域特征增强和多尺度时间特征提取的完整框架。首先,利用多尺度Transformer编码器提取不同层次的空间特征,并引入可学习频域滤波器以增强高频伪造痕迹的鲁棒性。同时,通过多尺度时间Transformer建模视频帧间时序不一致性,捕捉短程与长程动态异常。在此基础上,设计了一种基于门控交叉注意力的时空特征融合模块,实现跨模态的动态交互,从而生成更具判别力的融合特征。在FF++(LQ)、Celeb-DF与DFDC数据集上的实验结果显示,MSST的ACC和AUC分别达到92.73%、96.61%、95.15%和0.965、0.981、0.976。与现有主流方法相比,该方法在精确度和泛化性上均有明显的提升。