基于Transformer的说话人脸动画扩散生成方法

doi:10.19678/j.issn.1000-3428.0260205

摘要/Abstract

摘要： 针对现有说话人脸动画生成模型（Talking Face Generation）在身份一致性与音频一致性方面存在的问题，提出一种基于Transformer的说话人脸动画扩散生成方法。首先，为提升身份一致性，设计了一种全局-局部协同的身份对齐模块，该模块利用注意力池化机制聚合全局身份表征，同时引入可学习的位置编码矩阵以精确捕捉局部面部几何结构，从而显著增强了对身份信息的保持能力。其次，为提升音频一致性，提出了一种基于扩散Transformer的多层级特征交错融合方法，在每一层Transformer中深度融合音频与身份特征，并结合多阶段训练策略使生成的口型更加自然。在公开数据集LRS3、HDTF上的实验结果表明，相较于现有方法，所提出的模型在Sync-C和CSIM指标上取得了较好的效果。

Abstract: To address the issues of identity and audio consistency in existing talking face generation models, a Transformer-based diffusion for talking face generation is proposed. First, to improve identity consistency, a global-local collaborative identity alignment module is designed. This module utilizes attention pooling to aggregate global identity representations and introduces a learnable positional encoding matrix to accurately capture local facial geometry, thus significantly enhancing the ability to preserve identity information. Second, to improve audio consistency, a multi-level feature staggered fusion method based on a diffusion Transformer is proposed. Audio and identity features are deeply fused in each Transformer layer, and a multi-stage training strategy is combined to make the generated lip movements more natural. Experimental results on the public datasets LRS3 and HDTF show that, compared with existing methods, the proposed model achieves better performance in terms of the Sync-C and CSIM metrics.

申艺翔, 孙永奇, 赵思聪, 胡从刚. 基于Transformer的说话人脸动画扩散生成方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260205.

SHEN Yixiang, SUN Yongqi, ZHAO Sicong, HU Conggang. Transformer-based Diffusion for Talking Face Generation[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260205.

参考文献

[1] 杨璐嘉, 周溪石, 王俊超, 等. 基于深度学习的2D虚拟人驱动技术综述[J]. 计算机研究与发展, 2026, 63(3): 1-27. Yang Lujia, Zhou Xishi, Wang Junchao, et al. Survey of 2D Virtual Human Driving Technology Based on Deep Learning[J]. Journal of Computer Research and Development, 2026, 63(3): 1-27.
[2] Zhou H, Sun Y, Wu W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 4176-4186.
[3] Zhou Y, Han X, Shechtman E, et al. Makelttalk: speaker-aware talking-head animation[J]. ACM Transactions On Graphics (TOG), 2020, 39(6): 1-15.
[4] Wei H, Yang Z, Wang Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation[J]. arXiv preprint arXiv:2403.17694, 2024.
[5] Zhong W, Fang C, Cai Y, et al. Identity-preserving talking face generation with landmark and appearance priors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 9729-9738.
[6] Zhu Y, Bai L, Xu J, et al. Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter[J]. arXiv preprint arXiv:2503.06397, 2025.
[7] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[8] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33: 6840-6851.
[9] 乐铮, 胡永婷, 徐勇. 音频驱动的说话人面部视频生成与鉴别综述[J]. 计算机研究与发展, 2025, 62(10): [1] 2523-2544. Le Zheng, Hu Yongting, Xu Yong. Survey of Audio-Driven Talking Face Video Generation and Identification[J]. Journal of Computer Research and Development, 2025, 62(10): 2523-2544.
[2] Prajwal K R, Mukhopadhyay R, Namboodiri V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 484-492.
[3] Wang S, Li L, Ding Y, et al. Audio2head: Audio-driven one-shot talking-head generation with natural head motion[J]. arXiv preprint arXiv:2107.09293, 2021.
[4] Chen Z, Cao J, Chen Z, et al. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(3): 2403-2410.
[5] Peng Z, Liu J, Zhang H, et al. Omnisync: Towards universal lip synchronization via diffusion transformers[J]. arXiv preprint arXiv:2505.21448, 2025.
[6] Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 4195-4205.
[7] Cui J, Li H, Zhan Y, et al. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 21086-21095.
[8] Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020.
[9] Chung J S, Zisserman A. Out of time: automated lip sync in the wild[C]//Asian Conference on Computer Vision. Cham: Springer International Publishing, 2016: 251-263.
[10] 张冰源, 张旭龙, 王健宗, 等. 数字说话人脸生成技术综述[J]. 大数据, 2024,10(5): 74-95. ZHANG Bingyuan, ZHANG Xulong, WANG Jianzong, et al. Survey of audio-driven talking face generation technology[J]. Big data research, 2024, 10(5): 74-95.
[11] Son Chung J, Senior A, Vinyals O, et al. Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017: 6447-6456.
[12] Zhang Z, Li L, Ding Y, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision And Pattern Recognition. 2021: 3661-3670.
[13] Chung J S, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition[J]. arXiv preprint arXiv:1806.05622, 2018.
[14] Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in Neural Information Processing Systems, 2017, 30.
[15] Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 586-595.
[16] Deng J, Guo J, Xue N, et al. Arcface: Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4690-4699.
[17] Guo J, Zhang D, Liu X, et al. Liveportrait: Efficient portrait animation with stitching and retargeting control[J]. arXiv preprint arXiv:2407.03168, 2024.
[18] King D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755-1758.
[19] Lugaresi C, Tang J, Nash H, et al. Mediapipe: A framework for building perception pipelines[J]. arXiv preprint arXiv:1906.08172, 2019.

选择文件类型/文献管理软件名称

选择包含的内容