Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Voice-Face Matching Method With Multi-Scale Channel Attention

  

  • Published:2026-06-24

融合多尺度通道注意力机制的语音-人脸匹配方法

Abstract: Existing voice-face cross-modal matching methods often suffer from limited channel-wise feature discrimina¬tion and indistinct distinction of hard identities. To address these issues, this paper proposes an improved voice-face matching framework that incorporates a multi-scale channel attention mechanism and enhanced contrastive learning. Building upon an adaptive identity-weighted center baseline, we design a parallel main-fine-coarse branch channel attention module with explicit-implicit statistical fusion, greatly en¬hancing the activation of informative channels in both mel-spectrograms and facial feature maps. Furthermore, a bidirec¬tional InfoNCE contrastive loss is introduced and jointly optimized with the original cross-entropy and cross-modal N-pair losses under the guidance of adaptive identity weighting, which further widens the separation of challenging identities. Extensive experiments on the VoxCeleb and VGGFace overlapping dataset demonstrate that the proposed method con¬sistently outperforms state-of-the-art approaches such as SVHF and DIMNet in cross-modal verification, matching, and retrieval tasks. Compared with the baseline, it achieves 2.1% and 2.4% AUC gains in voice-to-face and face-to-voice verification, respectively.In addition, ablation studies confirm the effectiveness and complementarity of the multi-scale channel attention and contrastive learning components.

摘要: 为解决现有语音-人脸跨模态匹配方法存在特征通道重要性建模不足、对困难样本区分不明显的问题。提出一种融合多尺度通道注意力机制与对比学习增强策略的语音-人脸匹配方法。在自适应身份权重中心框架的基础上,构建了主-细-粗三尺度并行的通道注意力模块,并结合显式与隐式的统计融合方式,增强梅尔频谱图和人脸特征图的关键通道响应;此外在交叉熵损失和跨模态N-pair损失基础上,引入双向InfoNCE对比损失,并与自适应身份权重联合优化,从而提高困难样本的类间分离度。在VoxCeleb与VGGFace身份重叠数据集上的大量实验表明,在跨模态验证、匹配和检索任务中均显著优于SVHF、DIMNet等主流方法,相较于基线模型,语音到人脸验证AUC提升2.1%,人脸到语音提升2.4%。此外,消融实验进一步验证了多尺度通道注意力机制和对比学习损失的必要性与互补性。