Voice-Face Matching Method With Multi-Scale Channel Attention

doi:10.19678/j.issn.1000-3428.0260285

Abstract

Abstract: Existing voice-face cross-modal matching methods often suffer from limited channel-wise feature discrimina¬tion and indistinct distinction of hard identities. To address these issues, this paper proposes an improved voice-face matching framework that incorporates a multi-scale channel attention mechanism and enhanced contrastive learning. Building upon an adaptive identity-weighted center baseline, we design a parallel main-fine-coarse branch channel attention module with explicit-implicit statistical fusion, greatly en¬hancing the activation of informative channels in both mel-spectrograms and facial feature maps. Furthermore, a bidirec¬tional InfoNCE contrastive loss is introduced and jointly optimized with the original cross-entropy and cross-modal N-pair losses under the guidance of adaptive identity weighting, which further widens the separation of challenging identities. Extensive experiments on the VoxCeleb and VGGFace overlapping dataset demonstrate that the proposed method con¬sistently outperforms state-of-the-art approaches such as SVHF and DIMNet in cross-modal verification, matching, and retrieval tasks. Compared with the baseline, it achieves 2.1% and 2.4% AUC gains in voice-to-face and face-to-voice verification, respectively.In addition, ablation studies confirm the effectiveness and complementarity of the multi-scale channel attention and contrastive learning components.

摘要： 为解决现有语音-人脸跨模态匹配方法存在特征通道重要性建模不足、对困难样本区分不明显的问题。提出一种融合多尺度通道注意力机制与对比学习增强策略的语音-人脸匹配方法。在自适应身份权重中心框架的基础上，构建了主-细-粗三尺度并行的通道注意力模块，并结合显式与隐式的统计融合方式，增强梅尔频谱图和人脸特征图的关键通道响应；此外在交叉熵损失和跨模态N-pair损失基础上，引入双向InfoNCE对比损失，并与自适应身份权重联合优化，从而提高困难样本的类间分离度。在VoxCeleb与VGGFace身份重叠数据集上的大量实验表明，在跨模态验证、匹配和检索任务中均显著优于SVHF、DIMNet等主流方法，相较于基线模型，语音到人脸验证AUC提升2.1%，人脸到语音提升2.4%。此外，消融实验进一步验证了多尺度通道注意力机制和对比学习损失的必要性与互补性。

SUN Xiang, ZENG Zhaolong, MA Qiming. Voice-Face Matching Method With Multi-Scale Channel Attention[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260285.

孙翔, 曾昭龙, 马启明. 融合多尺度通道注意力机制的语音-人脸匹配方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260285.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260285

References

[1] 马金林,巩元文,马自萍,等.唇语识别的视觉特征提取方法综述[J].计算机科学与探索,2021,15(12):2256-2275. Ma J L,Gong Y W,Ma Z P,et al.Review of extracting methods for lip visual features[J].Journal of Frontiers of Computer Science and Technology,2021,15(12):2256-2275.
[2] Smith H M J, Ritchie K L, Baguley T S, et al. Face and voice identity matching accuracy is not improved by multimodal identity information[J]. British Journal of Psychology, 2025, 116(2): 367-385.
[3] Wu H,Wang D,Liu Y Y,et al.Decoding Subject's Own Name in the Primary Auditory Cortex[J]. Human Brain Mapping,2023,44(5):1985-1996.
[4] Liu Y,Wang Z Y,Ji S L,et al.Lip-Audio Modality Fusion for Deep Forgery Video Detection[J]. Computers, Materials & Continua,2025,82(2):3499-3515.
[5] Nagrani A, Albanie S, Zisserman A,et al. Seeing voices and hearing faces: cross-modal biometric matching [C]//InProceedings of the IEEE conference on computer vision and pattern recognition. New York:IEEE, 2018: 8427-8436.
[6] Wen Y, Ismail M A, Liu W, et al. Disjoint mapping network for cross-modal matching of voices and faces[J].Arxiv Preprint Arxiv:1807.04836,2018. [7] 柴汶泽,范菁,孙书魁,等.深度度量学习综述[J].计算机应用,2024,44(10):2995-3010.
Chai W Z,Fan J,Sun S K,et al.Overview of deep metric learning.Journal of Computer Applications,2024,44(10):2995-3010.
[8] Kim T, Kang J. Face and voice cross-modal association with learning convex feature embedding: T. Kim, J. Kang[J]. Multimedia Systems, 2025, 31(4): 296.
[9] 朱伟杰,陈莹.双流时间域信息交互的微表情识别卷积网络[J].计算机科学与探索,2022,16(04):950-958. Zhu W J,Chen Y.Micro-expression recognition convolutional network for dual-stream temporal-domain information interaction[J].Journal of Frontiers of Computer Science and Technology, 2022,16(04):950-958.
[10] Chen W, Zhu B, Xu K,et al.VoiceStyle: Voice-based face generation via cross-modal prototype contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications,2024,20(9):1-23.
[11] Phan T, Vu N, Pham C,et al.Multi-task Learning based Voice Verification with Triplet Loss[C]//In 2020 International Conference on Multimedia Analysis and Pattern Recognition.New York:IEEE,2020:1-6.
[12] Zhao Y B, Lin J W,Xuan Q,et al.HPILN: a feature learning framework for cross-modality person re-identification[J].IET Image Processing, 2020, 13(14):2897-2904.
[13] Mahmud T, Mo S, Tian Y, et al. Ma-avt: Modality alignment for parameter-efficient audio-visual transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7996-8005.
[14] Shabtay N, Zimerman I, Schwartz E, et al. CLIMP: Contrastive Language-Image Mamba Pretraining[J]. arXiv preprint arXiv:2601.06891, 2026.
[15] Yu Z, Liu X, Cheung Y M, et al. Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching[C]//2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 648-655.
[16] Wen P, Xu Q, Jiang Y,et al. Seeking the shape of sound: An adaptive framework for learning voice-face association[C]//InProceedings of the IEEE/CVF conference on computer vision and pattern recognition.IEEE,2021:16347-16356.
[17] Zhang Q, Wei Y, Han Z, et al. Multimodal fusion on low-quality data: A comprehensive survey[J]. arXiv preprint arXiv:2404.18947, 2024.
[18] Shi H , Hayat M , Cai J .Unpaired referring expression grounding via bidirectional cross-modal matching[J].Neurocomputing, 2023, 518:39-49.
[19] Wang Q, Zhang P, Xiong H, et al. Face. evolve: A high-performance face recognition library[J]. arXiv preprint arXiv:2107.08621, 2021.
[20] Liu B, Wang H, Qian Y. Towards lightweight speaker verification via adaptive neural network quantization[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3771-3784.
[21] Liu Y, Sun H, Guan W,et al. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework[J].Speech Communication,2022,1(139):1-9.
[22] Zhou J, Jia X, Li Q,et al. Uniface: Unified cross-entropy loss for deep face recognition[C]//InProceedings of the IEEE/CVF international conference on computer vision.Paris:IEEE,2023:20730-20739.
[23] Zolfaghari M, Zhu Y, Gehler P,et al. Crossclr: Cross-modal contrastive learning for multi-modal video representations[C]//InProceedings of the IEEE/CVF international conference on computer vision.IEEE,2021:1450-1459.
[24] Nagrani A, Chung J S, Xie W, et al. Voxceleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027.
[25] Parkhi O, Vedaldi A, Zisserman A,et al. Deep Face Recognition[C]//InBMVC 2015-Proceedings of the British Machine Vision Conference 2015.Swansea:BMVA,2015:4101-4112.
[26] Karamizadeh S, Shojae Chaeikar S, Salarian H. Combining MTCNN and Enhanced FaceNet with Adaptive Feature Fusion for Robust Face Recognition[J]. Technologies, 2025, 13(10): 450.
[27] McLoughlin I, Pham L, Song Y, et al. Spectrogram Features for Audio and Speech Analysis[J]. Applied Sciences, 2026, 16(2): 572.
[28] Peng C, He L, Su D,et al. Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder[J].arXiv:2404.09509,2024.
[29] Abidi S M H, Hassan S A, Raza S M, et al. Advances in Face Recognition: A Comprehensive Review of Feature Extraction and Dataset Evaluation[J]. Electronics, 2026, 15(2): 338.
[30] Li G, Gao Y, Huang X, et al. A Hard Negatives Mining and Enhancing Method for Multi-Modal Contrastive Learning[J]. Electronics, 2025, 14(4): 767.
[31] Rusak E, Reizinger P, Juhos A,et al. InfoNCE: Identifying the gap between theory and practice[J].arXiv:2407.00143,2024.
[32] Lahiri A, Kwatra V, Frueh C,et al. Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization[C]//InProceedings of the IEEE/CVF conference on computer vision and pattern recognition.IEEE,2021:2755-2764.
[33] Nguyen-Le H H, Tran V T, Nguyen D T, et al. Passive deepfake detection across multi-modalities: A comprehensive survey[J]. arXiv preprint arXiv:2411.17911, 2024.
[34] Nagrani A, Albanie S, Zisserman A,et al. Learnable pins: cross-modal embeddings for person identity[C]// Proceedings of the European Conference on Computer Vision(ECCV).Munich:Springer, 2018:71-88.
[35] Wang R, Liu X, Cheung Y, et al. Learning discriminative joint embeddings for efficient face and voice association[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1881-1884.
[36] Ma Q, Bu F, Wang R, et al. Cross-Modal Simplex Center Learning for Speech-Face Association[J]. Computers, Materials & Continua, 2025, 82(3).

Please choose a citation manager

Content to export