作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (7): 240-250. doi: 10.19678/j.issn.1000-3428.0067789

• 图形图像处理 • 上一篇    下一篇

融合空间特征的多尺度深度伪造检测方法

张溢文, 蔡满春*(), 陈咏豪, 朱懿, 姚利峰   

  1. 中国人民公安大学信息网络安全学院, 北京 100038
  • 收稿日期:2023-06-05 出版日期:2024-07-15 发布日期:2023-12-19
  • 通讯作者: 蔡满春
  • 基金资助:
    中国人民公安大学网络空间安全执法技术双一流创新研究专项(2023SYL07)

Multi-Scale Deepfake Detection Method with Fusion of Spatial Features

Yiwen ZHANG, Manchun CAI*(), Yonghao CHEN, Yi ZHU, Lifeng YAO   

  1. College of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China
  • Received:2023-06-05 Online:2024-07-15 Published:2023-12-19
  • Contact: Manchun CAI

摘要:

随着深度学习的快速发展, 深度伪造技术作为一种基于深度学习生成模型的图像篡改技术迅速兴起。深度伪造视频图像的泛滥给国家和社会安全带来了负面影响, 使得深度伪造检测技术的重要性日益凸显。然而, 现有基于卷积神经网络(CNN)或ViT的深度伪造检测技术普遍存在模型参数量大、训练速度慢、容易过拟合、应对视频压缩或噪声的鲁棒性差等问题。为此, 提出一种融合空间特征的多尺度深度伪造检测方法。首先采用自动白平衡(AWB)算法对输入图像进行对比度调整, 以增强模型的鲁棒性; 然后利用MViT和CNN分别提取输入图像的多尺度全局和局部特征; 接着提出一种改进的稀疏交叉注意力机制, 对用MViT提取的全局特征和用CNN提取的局部特征进行融合, 提升模型的识别效果; 最后针对融合后的特征, 通过多层感知机(MLP)进行分类。实验结果表明, 该方法在Deepfakes、FaceSwap和Celeb-DF(v2)数据集上的帧水平AUC分别达到0.986、0.984和0.988, 且在跨压缩率实验中表现出了较强的鲁棒性, 模型改进前后的对比也验证了所提各模块对检测结果的提升作用。

关键词: 深度伪造, 卷积神经网络, 特征融合, 交叉注意力, 数据增强

Abstract:

With the rapid advancement in deep learning, deepfake technology has gained significant momentum as a form of image manipulation based on generative models. The proliferation of deepfake videos and images has a detrimental sociopolitical impact, highlighting the increasing significance of deepfake detection techniques. Existing deepfake detection methods based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT) commonly suffer from challenges such as large sizes of model parameters, slow training speeds, susceptibility to overfitting, and limited robustness against video compression and noise. To address these challenges, a multi-scale deepfake detection method that integrates spatial features is proposed herein. Firstly, an Automatic White Balance (AWB) algorithm is employed to adjust the contrast of input images, thereby enhancing robustness of the model. Subsequently, Multi-scale ViT (MViT) and CNN are separately utilized to extract the multi-scale global and local features, respectively, of the input images. These global and local features are then fused together using an improved sparse cross-attention mechanism to enhance the recognition performance of the model. Finally, the fused features are classified using a Multi-Layer Perceptron (MLP). According to the experimental results, the proposed model achieves frame-level Area Under the Curve (AUC) scores of 0.986, 0.984, and 0.988 on the Deepfakes, FaceSwap, and Celeb-DF (v2) datasets, respectively, demonstrating strong robustness in cross-compression experiments. Additionally, comparative experiments before and after specific model improvements have validated the gains provided by each module in terms of detection results.

Key words: deepfake, Convolutional Neural Networks (CNN), feature fusion, cross attention, data augmentation