作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于Transformer的说话人脸动画扩散生成方法

  • 发布日期:2026-05-21

Transformer-based Diffusion for Talking Face Generation

  • Published:2026-05-21

摘要: 针对现有说话人脸动画生成模型(Talking Face Generation)在身份一致性与音频一致性方面存在的问题,提出一种基于Transformer的说话人脸动画扩散生成方法。首先,为提升身份一致性,设计了一种全局-局部协同的身份对齐模块,该模块利用注意力池化机制聚合全局身份表征,同时引入可学习的位置编码矩阵以精确捕捉局部面部几何结构,从而显著增强了对身份信息的保持能力。其次,为提升音频一致性,提出了一种基于扩散Transformer的多层级特征交错融合方法,在每一层Transformer中深度融合音频与身份特征,并结合多阶段训练策略使生成的口型更加自然。在公开数据集LRS3、HDTF上的实验结果表明,相较于现有方法,所提出的模型在Sync-C和CSIM指标上取得了较好的效果。

Abstract: To address the issues of identity and audio consistency in existing talking face generation models, a Transformer-based diffusion for talking face generation is proposed. First, to improve identity consistency, a global-local collaborative identity alignment module is designed. This module utilizes attention pooling to aggregate global identity representations and introduces a learnable positional encoding matrix to accurately capture local facial geometry, thus significantly enhancing the ability to preserve identity information. Second, to improve audio consistency, a multi-level feature staggered fusion method based on a diffusion Transformer is proposed. Audio and identity features are deeply fused in each Transformer layer, and a multi-stage training strategy is combined to make the generated lip movements more natural. Experimental results on the public datasets LRS3 and HDTF show that, compared with existing methods, the proposed model achieves better performance in terms of the Sync-C and CSIM metrics.