Adversarial Generation of Voice-Controlled Speaking Face Videos Based on Modal Affine Fusion

doi:10.19678/j.issn.1000-3428.0069992

Abstract

Abstract:

Generating speaking face videos from speech, involving the processing both audio and visual modalities, is a current research hotspot. A key challenge is achieving precise alignment between lip movements in the video and the input audio. To address this problem, this study proposes an end-to-end, speech-controlled speaking face video generation adversarial model, which mainly consists of a modal affine fusion-based generator, a visual quality discriminator, and a lip synchronization discriminator. The affine fusion-based generator adds audio information during face feature decoding through the Modal Affine Fusion Block (MAFBlock), effectively fuses audio information with face information and enables the audio to be better controlled for speaking face video generation. Spatial and channel attention mechanisms are incorporated to enhance the model's focus on local facial regions. The model employs a dual-discriminator strategy to enhance both visual quality and lip synchronization accuracy. The lip synchronization discriminator constrains lip movements by evaluating the similarity between the audio and the generated lip shapes without changing the overall contour and face details, thereby providing finer control over lip movement generation. The visual quality discriminator assesses the realism of the generated image frames to improve image quality. A comparative experimental analysis is conducted with several existing representative models on two audiovisual datasets. On the LRS2 validation set, the proposed model achieves an LSE-C score of 8.128 and an LSE-D score of 6.112, which are 4.3% and 4.4% higher than those of the baseline, respectively. On the LRS3 validation set, it achieves LSE-C and LSE-D scores of 7.963 and 6.259, representing improvements of 6.2% and 6.9% over the baseline scores, respectively.

Key words: speaking face generation, video generation, lip synchronization, audio-driven generation, spatial attention, channel attention

摘要：

语音生成说话人脸视频是当前一个研究热点, 涉及音频和视觉两个模态的处理, 需要着重解决说话时唇部运动和输入音频对齐的问题。针对该问题提出一种端到端的语音控制说话人脸视频生成对抗模型, 主要包括模态仿射融合的生成器、视觉质量判别器和唇形同步判别器, 基于仿射融合的生成器通过模态仿射融合模块(MAFBlock), 在人脸特征解码过程中添加音频信息, 有效地融合音频信息和人脸信息, 使得音频能够更好地控制说话人脸视频生成。引入空间注意力和通道注意力机制, 增强模型对于局部区域的关注。基于双判别器提高模型生成质量和唇形同步率, 唇形同步判别器用于约束唇部运动, 对音频和唇形进行相似性判断, 在不改变整体轮廓和脸部细节的前提下更精细地控制唇部动作生成, 视觉质量判别器判断生成图片的真实性, 提高生成图片质量。在两个视听数据集上与多个现有的代表性模型进行对比实验, 结果表明: 该模型在LRS2验证集上具有8.128的LSE-C分数和6.112的LSE-D分数, 相比于Baseline分别提升了4.3%和4.4%;在LRS3验证集上具有7.963的LSE-C分数和6.259的LSE-D分数, 相比于Baseline分别提升了6.2%和6.9%。

关键词: 说话人脸生成, 视频生成, 唇形同步, 音频驱动生成, 空间注意力, 通道注意力

CHEN Shihang, SUN Yubao. Adversarial Generation of Voice-Controlled Speaking Face Videos Based on Modal Affine Fusion[J]. Computer Engineering, 2026, 52(2): 393-403.

陈诗航, 孙玉宝. 基于模态仿射融合的语音控制说话人脸视频对抗生成[J]. 计算机工程, 2026, 52(2): 393-403.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069992

https://www.ecice06.com/EN/Y2026/V52/I2/393

Figures/Tables 15

Fig.1 Speaking face video generation model based on modal affine fusion

Fig.2 Face pre-processing

Fig.3 Audio pre-processing

Fig.4 Affine fusion

Fig.5 MAFBlock structure

Fig.6 Generator structure based on affine fusion

Fig.7 Visual quality discriminator and lip synchronization discriminator

Fig.8 Comparison of images generated by two training methods

Fig.9 Comparison of lip details

Fig.10 Speaking face generation images

Fig.11 Face images generated from word pronunciations

References 30

1	PRAJWAL K R, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 484-492.
2	PRAJWAL K R, MUKHOPADHYAY R, PHILIP J, et al. Towards automatic face-to-face translation[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York, USA: ACM Press, 2019: 1428-1436.
3	MITTAL G, WANG B Y. Animating face using disentangled audio representations[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2020: 3290-3298.
4	AFOURAS T, CHUNG J S, ZISSERMAN A. LRS3-TED: a large-scale dataset for visual speech recognition[EB/OL]. [2024-05-11]. https://arxiv.org/abs/1809.00496.
5	AFOURAS T , CHUNG J S , SENIOR A , et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 8717- 8727. doi: 10.1109/TPAMI.2018.2889052
6	KARRAS T, LAINE S, AILA T M. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2019: 4401-4410.
7	GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144. doi: 10.1145/3422622
8	ARJOVSKY M, CHINTALA S, BOTTOU L. Wasserstein generative adversarial networks[C]//Proceedings of the International Conference on Machine Learning. [S. l.]: PMLR, 2017: 214-223.
9	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 5769-5779.
10	MIYATO T, KATAOKA T, KOYAMA M, et al. Spectral normalization for generative adversarial networks[EB/OL]. [2024-05-11]. https://arxiv.org/abs/1802.05957.
11	张慧妍, 梁勇, 兰景宏, 等. 基于记忆模块与过滤式生成对抗网络的入侵检测方法. 计算机工程, 2024, 50 (6): 197- 207. doi: 10.19678/j.issn.1000-3428.0068157
	ZHANG H Y , LIANG Y , LAN J H , et al. Intrusion detection method based on memory module and filtered generative adversarial network. Computer Engineering, 2024, 50 (6): 197- 207. doi: 10.19678/j.issn.1000-3428.0068157
12	KARRAS T, LAINE S, AITTALA M, et al. Analyzing and improving the image quality of StyleGAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 8110-8119.
13	PATASHNIK O, WU Z Z, SHECHTMAN E, et al. StyleCLIP: text-driven manipulation of StyleGAN imagery[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 2065-2074.
14	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning. [S. l.]: PMLR, 2021: 8748-8763.
15	SUWAJANAKORN S , SEITZ S M , KEMELMACHER-SHLIZERMAN I . Synthesizing Obama. ACM Transactions on Graphics, 2017, 36 (4): 1- 13.
16	GUO Y D, CHEN K Y, LIANG S, et al. AD-NeRF: audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 5764-5774.
17	YE Z, JIANG Z, REN Y, et al. GeneFace: generalized and high-fidelity audio-driven 3D talking face synthesis[EB/OL]. [2024-05-11]. https://arxiv.org/abs/2301.13430.
18	LAHIRI A, KWATRA V, FRUEH C, et al. LipSync3D: data-efficient learning of personalized 3D talking faces from video using pose and lighting normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 2755-2764.
19	FRIED O , TEWARI A , ZOLLHÖFER M , et al. Text-based editing of talking-head video. ACM Transactions on Graphics, 2019, 38 (4): 1- 14.
20	THIES J, ELGHARIB M, TEWARI A, et al. Neural voice puppetry: audio-driven facial reenactment[C]//Proceedings of the 16th European Conference on Computer Vision. Berlin, Germany: Springer International Publishing, 2020: 716-731.
21	CHUNG J S, JAMALUDIN A, ZISSERMAN A. You said that?[EB/OL]. [2024-05-11]. https://ludwig.guru/s/you+said+that.
22	ZHOU H, LIU Y, LIU Z W, et al. Talking face generation by adversarially disentangled audio-visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 9299-9306.
23	ZHOU H, SUN Y S, WU W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 4176-4186.
24	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision (ECCV). Berlin, Germany: Springer International Publishing, 2018: 3-19.
25	CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//Proceedings of ACCV'16. Berlin, Germany: Springer International Publishing, 2016: 251-263.
26	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2024-05-11]. https://arxiv.org/abs/1412.6980.
27	WANG Z , BOVIK A C , SHEIKH H R , et al. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004, 13 (4): 600- 612. doi: 10.1109/TIP.2003.819861
28	PARK S J, KIM M, HONG J, et al. SyncTalkFace: talking face generation with precise lip-syncing via audio-lip memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 2062-2070.
29	CHEN L L, MADDOX R K, DUAN Z Y, et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2019: 7824-7833.
30	ZHANG Z M, HU Z P, DENG W J, et al. DINet: deformation inpainting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 3543-3551.

[1]	LIU Huilin, FANG Qiong, WANG Yansi, ZHANG Shunxiang, SU Shuzhi. Lightweight Photorealistic Image Style Transfer with Collaborative Optimization of Shuffle Gate Attention and Channel Alignment Whitening and Coloring Transform [J]. Computer Engineering, 2026, 52(2): 197-208.
[2]	HU Yongtao, HUANG Hongqiong. Multi-Branch Clothes-Changing Person Re-Identification with Feature Fusion and Channel Attention [J]. Computer Engineering, 2025, 51(1): 225-234.
[3]	YANG Shuo, WANG Yiding. Facial Animation Algorithm Based on Improved Thin Plate Spline Motion Model [J]. Computer Engineering, 2024, 50(6): 255-265.
[4]	LI Zhenlu, HUANG Wei, SUN Kai. Research on Lightweight Road-Target-Recognition Algorithm in Complex Environment [J]. Computer Engineering, 2024, 50(4): 219-227.
[5]	LIU Yanhong, YANG Qiuxiang, HU Shuai. Research on Multi-Scale Feature Fusion Dehazing Network Based on Feature Differences [J]. Computer Engineering, 2024, 50(4): 247-257.
[6]	Yanzhou FENG, Jianxia LIU, Haiyi WANG, Guohao FENG, Yu BAI. Real Image Denoising Method Based on Multi-Level Residual Information Distillation [J]. Computer Engineering, 2024, 50(3): 216-223.
[7]	BIAN Yuxing, HUANG Rong, ZHOU Shubo, LIU Hao. Multi-Cover Image Steganography Model Based on Invertible Neural Network [J]. Computer Engineering, 2024, 50(12): 213-223.
[8]	ZHAO Jian, CUI Qian, SHI Jia, LIU Yue. Bimodal Fused Depressive Tendency Recognition Algorithm Based on Textual and Acoustic Features [J]. Computer Engineering, 2024, 50(11): 49-58.
[9]	Long SUN, Rongfen ZHANG, Yuhong LIU, Tingli RAO. Mask Wearing Detection Algorithm for Dense Crowds from a Monitoring Perspective [J]. Computer Engineering, 2023, 49(9): 313-320.
[10]	Shanshu BAO, Bo CHE, Linhong DENG. Lung Sound Signal Recognition Based on Dual-Source Domain Transfer Learning [J]. Computer Engineering, 2023, 49(9): 295-302, 312.
[11]	Ang LU, Jun CHU, Lu LENG. Image Dehazing Based on High-Frequency and Low-Frequency Feature Enhancement [J]. Computer Engineering, 2023, 49(8): 174-181.
[12]	ZOU Guojian, LAI Ziliang, LI Ye. Traffic Speed Prediction Based on Spatio-Temporal Attention Network for Dynamic Expressway Network [J]. Computer Engineering, 2023, 49(2): 303-313.
[13]	Qunpo LIU, Yueqin SHENG, Ruxin GAO, Xuhui BU. Sign Language Recognition Based on Keyframe and Attention Residual Network [J]. Computer Engineering, 2023, 49(12): 224-230, 242.
[14]	Jianwei LI, Xiaoqi LÜ, Yu GU. Dermoscopy Image Classification Method Based on Improved ConvNeXt [J]. Computer Engineering, 2023, 49(10): 239-246, 254.
[15]	WANG Shuaikun, ZHOU Zhiyong, HU Jisu, QIAN Xusheng, GENG Chen, CHEN Guangqiang, JI Jiansong, DAI Yakang. Unsupervised Registration for Liver CT-MR Images Based on Deep Learning [J]. Computer Engineering, 2023, 49(1): 223-233.

Please choose a citation manager

Content to export