Multi-Scale Deepfake Detection Method with Fusion of Spatial Features

doi:10.19678/j.issn.1000-3428.0067789

Abstract

Abstract:

With the rapid advancement in deep learning, deepfake technology has gained significant momentum as a form of image manipulation based on generative models. The proliferation of deepfake videos and images has a detrimental sociopolitical impact, highlighting the increasing significance of deepfake detection techniques. Existing deepfake detection methods based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT) commonly suffer from challenges such as large sizes of model parameters, slow training speeds, susceptibility to overfitting, and limited robustness against video compression and noise. To address these challenges, a multi-scale deepfake detection method that integrates spatial features is proposed herein. Firstly, an Automatic White Balance (AWB) algorithm is employed to adjust the contrast of input images, thereby enhancing robustness of the model. Subsequently, Multi-scale ViT (MViT) and CNN are separately utilized to extract the multi-scale global and local features, respectively, of the input images. These global and local features are then fused together using an improved sparse cross-attention mechanism to enhance the recognition performance of the model. Finally, the fused features are classified using a Multi-Layer Perceptron (MLP). According to the experimental results, the proposed model achieves frame-level Area Under the Curve (AUC) scores of 0.986, 0.984, and 0.988 on the Deepfakes, FaceSwap, and Celeb-DF (v2) datasets, respectively, demonstrating strong robustness in cross-compression experiments. Additionally, comparative experiments before and after specific model improvements have validated the gains provided by each module in terms of detection results.

Key words: deepfake, Convolutional Neural Networks (CNN), feature fusion, cross attention, data augmentation

摘要：

随着深度学习的快速发展, 深度伪造技术作为一种基于深度学习生成模型的图像篡改技术迅速兴起。深度伪造视频图像的泛滥给国家和社会安全带来了负面影响, 使得深度伪造检测技术的重要性日益凸显。然而, 现有基于卷积神经网络(CNN)或ViT的深度伪造检测技术普遍存在模型参数量大、训练速度慢、容易过拟合、应对视频压缩或噪声的鲁棒性差等问题。为此, 提出一种融合空间特征的多尺度深度伪造检测方法。首先采用自动白平衡(AWB)算法对输入图像进行对比度调整, 以增强模型的鲁棒性; 然后利用MViT和CNN分别提取输入图像的多尺度全局和局部特征; 接着提出一种改进的稀疏交叉注意力机制, 对用MViT提取的全局特征和用CNN提取的局部特征进行融合, 提升模型的识别效果; 最后针对融合后的特征, 通过多层感知机(MLP)进行分类。实验结果表明, 该方法在Deepfakes、FaceSwap和Celeb-DF(v2)数据集上的帧水平AUC分别达到0.986、0.984和0.988, 且在跨压缩率实验中表现出了较强的鲁棒性, 模型改进前后的对比也验证了所提各模块对检测结果的提升作用。

关键词: 深度伪造, 卷积神经网络, 特征融合, 交叉注意力, 数据增强

Yiwen ZHANG, Manchun CAI, Yonghao CHEN, Yi ZHU, Lifeng YAO. Multi-Scale Deepfake Detection Method with Fusion of Spatial Features[J]. Computer Engineering, 2024, 50(7): 240-250.

张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合空间特征的多尺度深度伪造检测方法[J]. 计算机工程, 2024, 50(7): 240-250.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067789

https://www.ecice06.com/EN/Y2024/V50/I7/240

Figures/Tables 13

Fig.1 Overall structure of the model

Fig.2 Example of image enhancement

Fig.3 Multi head pooling attention

Fig.4 CNN Block and downsampling module

Fig.5 Sparse cross-attention

Fig.6 Comparison of feature fusion methods

Fig.7 ROC curves of different methods on Deepfakes

Fig.8 ROC curves of different methods on FaceSwap

Fig.9 ROC curves of different methods on Celeb-DF(v2)

References 36

1	GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks. Communications of the ACM, 2020, 63(11): 139- 144. doi: 10.1145/3422622
2	KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-05-05]. https://arxiv.org/abs/1312.6114.
3	CHESNEY R, CITRON D K. Deep fakes: a looming challenge for privacy, democracy, and national security[EB/OL]. [2023-05-05]. https://scholarship.law.bu.edu/cgi/viewcontent.cgi?article=1640&context=faculty_scholarship.
4	耿鹏志, 樊红兴, 张翌阳, 等. 基于篡改伪影的深度伪造检测方法. 计算机工程, 2021, 47(12): 156- 162. URL
	GENG P Z, FAN H X, ZHANG Y Y, et al. Deepfake detection method based on tampering artifacts. Computer Engineering, 2021, 47(12): 156- 162. URL
5	李柯, 李邵梅, 吉立新, 等. 基于自注意力胶囊网络的伪造人脸检测方法. 计算机工程, 2022, 48(2): 194-200, 206. URL
	LI K, LI S M, JI L X, et al. Method of face forgery detection based on self-attention capsule network. Computer Engineering, 2022, 48(2): 194-200, 206. URL
6	FAN H Q, XIONG B, MANGALAM K, et al. Multiscale Vision Transformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 6824-6835.
7	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 770-778.
8	MONTSERRAT D M, HAO H X, YARLAGADDA S K, et al. Deepfakes detection with automatic face weighting[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington D. C., USA: IEEE Press, 2020: 668-669.
9	SUN Z, HAN Y, HUA Z, et al. Improving the efficiency and robustness of deepfakes detection through precise geometric features[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 3609-3618.
10	DE LIMA O, FRANKLIN S, BASU S, et al. Deepfake detection using spatiotemporal convolutional networks[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2006.14749.
11	CHUGH K, GUPTA P, DHALL A, et al. Not made for each other- audio-visual dissonance-based deepfake detection and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 439-447.
12	KNAFO G, FRIED O. FakeOut: leveraging out-of-domain self-supervision for multi-modal video deepfake detection[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2212.00773.
13	KHALID H, TARIQ S, WOO S S. FakeAVCeleb: a novel audio-video multimodal deepfake dataset[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2108.05080.
14	BONETTINI N, CANNAS E D, MANDELLI S, et al. Video face manipulation detection through ensemble of CNNs[C]//Proceedings of the 25th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 5012-5019.
15	WANG G J, JIANG Q, JIN X, et al. FFR_FD: effective and fast detection of DeepFakes via feature point defects. Information Sciences: an International Journal, 2022, 596(C): 472- 488.
16	ZHAO H Q, WEI T Y, ZHOU W B, et al. Multi-attentional deepfake detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 2185-2194.
17	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-05-05]. https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a.
18	WODAJO D, ATNAFU S. Deepfake video detection using convolutional Vision Transformer[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2102.11126.
19	HEO Y J, CHOI Y J, LEE Y W, et al. Deepfake detection scheme based on Vision Transformer and distillation[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2104.01353.
20	ZHANG K P, ZHANG Z P, LI Z F, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499- 1503. doi: 10.1109/LSP.2016.2603342
21	LI Y H, WU C Y, FAN H Q, et al. MViTv2: improved multiscale Vision Transformers for classification and detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 4804-4814.
22	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-05-05]. https://arxiv.org/abs/1706.03762.
23	CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 1251-1258.
24	VEIT A, MICHAEL J. BELONGIE S. Residual networks behave like ensembles of relatively shallow networks[EB/OL]. [2023-05-05]. https://arxiv.org/abs/1605.06431.
25	CHEN C F R, FAN Q F, PANDA R. CrossViT: cross-attention multi-scale Vision Transformer for image classification[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 357-366.
26	ZHAO Z X, BAI H W, ZHANG J S, et al. CDDFuse: correlation-driven dual-branch feature decomposition for multi-modality image fusion[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 5906-5916.
27	ZHU L, WANG X J, KE Z H, et al. BiFormer: Vision Transformer with Bi-level routing attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 10323-10333.
28	RÖSSLER A, COZZOLINO D, VERDOLIVA L, et al. FaceForensics: a large-scale video dataset for forgery detection in human faces[EB/OL]. [2023-05-05]. https://www.semanticscholar.org/paper/FaceForensics%3A-A-Large-scale-Video-Dataset-for-in-R%C3%B6ssler-Cozzolino/b82058b4bf630d33e129ab097b8cacf6cc3d4556.
29	YANG X, LI Y Z, LÜ S W. Exposing deep fakes using inconsistent head poses[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2019: 8261-8265.
30	KORSHUNOV P, MARCEL S. DeepFakes: a new threat to face recognition? assessment and detection[EB/OL]. [2023-05-05]. https://arxiv.org/abs/1812.08685.
31	ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. FaceForensics++: learning to detect manipulated facial images[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 1-11.
32	ZI B J, CHANG M H, CHEN J J, et al. WildDeepfake: a challenging real-world dataset for deepfake detection[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 2382-2390.
33	DOLHANSKY B, BITTON J, PFLAUM B, et al. The DeepFake Detection Challenge (DFDC) dataset[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2006.07397.
34	LI Y Z, YANG X, SUN P, et al. Celeb-DF: a large-scale challenging dataset for DeepFake forensics[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 3207-3216.
35	HU J, LIAO X, WANG W, et al. Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(3): 1089- 1102. doi: 10.1109/TCSVT.2021.3074259
36	COCCOMINI D A, MESSINA N, GENNARO C, et al. Combining EfficientNet and Vision Transformers for video deepfake detection[EB/OL]. [2023-05-05]. https://arxiv.org/abs/2107.02612.

[1]	LI Junyi, LI Xiangyang, LONG Chaoxun, LI Haiyan, LI Hongsong, YU Pengfei. Wild Mushroom Classification Based on Multi-level Region Selection and Cross-layer Feature Fusion [J]. Computer Engineering, 2024, 50(9): 179-188.
[2]	Huaqing ZHANG, Zhangtao XIA, Xiaoqing LU, Jijun TONG. Named Entity Recognition of Vascular Surgery Based on Glyph Features [J]. Computer Engineering, 2024, 50(8): 13-21.
[3]	Huayu LI, Zhikang ZHANG, Yang YAN, Yang YUE. Enhanced Domain Multi-modal Entity Recognition Based on Knowledge Graph [J]. Computer Engineering, 2024, 50(8): 31-39.
[4]	Suolan LIU, Yan WANG, Hongyuan WANG, Shengsheng ZHU. Human Behavior Recognition Based on Multi-Stream Semantic Graph Convolutional Network [J]. Computer Engineering, 2024, 50(8): 64-74.
[5]	Wanqiu ZHAO, Junhu ZHANG, Haitao LI. Feature Fusion Network with Parallel Structure for Building Segmentation [J]. Computer Engineering, 2024, 50(8): 239-248.
[6]	Hong ZHAO, Xiao WANG. Study on Lesion Segmentation of Melanoma Images Based on Swin-Transformer [J]. Computer Engineering, 2024, 50(8): 249-258.
[7]	Li MIN, Bingjie DONG, Dong AN. Semantic Segmentation Algorithm Based on Multi-Attention Mechanism and Cross-Feature Fusion [J]. Computer Engineering, 2024, 50(8): 282-289.
[8]	Yuhang CHEN, Yong YANG, Xianmusiya·Maimaitiming, Palidan·Tuerxun, Xiaochao FAN, Ge REN, Yufeng DIAO. Automatic Essay Scoring Method Based on Topic Perception and Semantic Enhancement [J]. Computer Engineering, 2024, 50(8): 363-371.
[9]	Juquan TAN, Ran WANG. Dynamic Time Warping Capture Algorithm for 3D Human Body Movements in Track and Field Video Recording Under Feature Fusion [J]. Computer Engineering, 2024, 50(7): 71-78.
[10]	Jintao WANG, Ang QIN, Yuan ZHANG, Yifei CHEN, Tingfeng WANG, Chenglin XIE, Gang ZOU. Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion [J]. Computer Engineering, 2024, 50(7): 324-332.
[11]	Zhiwei LIN, Zuyuan YANG, Siqiu WANG, Chao YANG. Athlete Detection Algorithm Based on Multi-scale Linear Global Attention [J]. Computer Engineering, 2024, 50(7): 352-359.
[12]	YANG Shuo, WANG Yiding. Facial Animation Algorithm Based on Improved Thin Plate Spline Motion Model [J]. Computer Engineering, 2024, 50(6): 255-265.
[13]	LI Yakang, CHEN Gang. Automated Selection for Physical Models of Small-Angle Neutron Scattering [J]. Computer Engineering, 2024, 50(6): 56-64.
[14]	CAI Yixiang, QIN Pinle, ZENG Jianchao, JIN Zanxia, QIN Jia, ZHAI Shuangjiao. Research on Person Re-Identification Method for Large-Angle Viewpoint Differences [J]. Computer Engineering, 2024, 50(5): 330-341.
[15]	GONG Ajuan, PAN Tianrong. Discussion on Deep-Learning Strategies for Diagnosis of Multiple Diseases in Fundus Diseases [J]. Computer Engineering, 2024, 50(5): 363-372.

Please choose a citation manager

Content to export