作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (2): 393-403. doi: 10.19678/j.issn.1000-3428.0069992

• 大模型与生成式人工智能 • 上一篇    

基于模态仿射融合的语音控制说话人脸视频对抗生成

陈诗航, 孙玉宝   

  1. 南京信息工程大学计算机学院, 江苏 南京 210044
  • 收稿日期:2024-06-12 修回日期:2024-09-22 发布日期:2024-12-10
  • 作者简介:陈诗航(CCF学生会员),男,硕士研究生,主研方向为图像生成;孙玉宝(通信作者),教授、博士。E-mail:sunyb@nuist.edu.cn
  • 基金资助:
    国家自然科学基金(U2001211,62276139)。

Adversarial Generation of Voice-Controlled Speaking Face Videos Based on Modal Affine Fusion

CHEN Shihang, SUN Yubao   

  1. School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, Jiangsu, China
  • Received:2024-06-12 Revised:2024-09-22 Published:2024-12-10

摘要: 语音生成说话人脸视频是当前一个研究热点,涉及音频和视觉两个模态的处理,需要着重解决说话时唇部运动和输入音频对齐的问题。针对该问题提出一种端到端的语音控制说话人脸视频生成对抗模型,主要包括模态仿射融合的生成器、视觉质量判别器和唇形同步判别器,基于仿射融合的生成器通过模态仿射融合模块(MAFBlock),在人脸特征解码过程中添加音频信息,有效地融合音频信息和人脸信息,使得音频能够更好地控制说话人脸视频生成。引入空间注意力和通道注意力机制,增强模型对于局部区域的关注。基于双判别器提高模型生成质量和唇形同步率,唇形同步判别器用于约束唇部运动,对音频和唇形进行相似性判断,在不改变整体轮廓和脸部细节的前提下更精细地控制唇部动作生成,视觉质量判别器判断生成图片的真实性,提高生成图片质量。在两个视听数据集上与多个现有的代表性模型进行对比实验,结果表明:该模型在LRS2验证集上具有8.128的LSE-C分数和6.112的LSE-D分数,相比于Baseline分别提升了4.3%和4.4%;在LRS3验证集上具有7.963的LSE-C分数和6.259的LSE-D分数,相比于Baseline分别提升了6.2%和6.9%。

关键词: 说话人脸生成, 视频生成, 唇形同步, 音频驱动生成, 空间注意力, 通道注意力

Abstract: Generating speaking face videos from speech, involving the processing both audio and visual modalities, is a current research hotspot. A key challenge is achieving precise alignment between lip movements in the video and the input audio. To address this problem, this study proposes an end-to-end, speech-controlled speaking face video generation adversarial model, which mainly consists of a modal affine fusion-based generator, a visual quality discriminator, and a lip synchronization discriminator. The affine fusion-based generator adds audio information during face feature decoding through the Modal Affine Fusion Block (MAFBlock), effectively fuses audio information with face information and enables the audio to be better controlled for speaking face video generation. Spatial and channel attention mechanisms are incorporated to enhance the model's focus on local facial regions. The model employs a dual-discriminator strategy to enhance both visual quality and lip synchronization accuracy. The lip synchronization discriminator constrains lip movements by evaluating the similarity between the audio and the generated lip shapes without changing the overall contour and face details, thereby providing finer control over lip movement generation. The visual quality discriminator assesses the realism of the generated image frames to improve image quality. A comparative experimental analysis is conducted with several existing representative models on two audiovisual datasets. On the LRS2 validation set, the proposed model achieves an LSE-C score of 8.128 and an LSE-D score of 6.112, which are 4.3% and 4.4% higher than those of the baseline, respectively. On the LRS3 validation set, it achieves LSE-C and LSE-D scores of 7.963 and 6.259, representing improvements of 6.2% and 6.9% over the baseline scores, respectively.

Key words: speaking face generation, video generation, lip synchronization, audio-driven generation, spatial attention, channel attention

中图分类号: