基于八元组损失的跨分辨率说话人验证优化

doi:10.19678/j.issn.1000-3428.0069055

摘要/Abstract

摘要：

声纹识别中说话人验证在人机交互、医疗诊断和线上会议等现实领域起关键作用。基于深度神经网络(DNN)的说话人嵌入技术在说话人验证任务中变得越来越流行。Open-Set下的说话人验证是一个多分类任务，本质上是度量学习。现有的度量学习性能高度依赖于大批量具有标签信息的高分辨率语音样本。为了解决这个问题，提出一个基于度量学习的最小化相同类距离目标算法。该算法在三元组损失的基础上，引入八元组损失，利用4个三元组损失项捕捉高分辨率和低分辨率语音之间的关系，并运用困难样本挖掘技术来选择合适的数据三元组，使得模型分类更加准确。其次，为提升噪声干扰场景中低分辨率语音信号的分类鲁棒性，引入在线数据增强策略，使用RIR和MUSAN数据集对模型数据进行增强，利用数据增强后的数据和引入八元组损失对ECAPA-TDNN预模型进行微调训练，使得该微调网络能在噪声环境下处理低分辨率语音信息，提高模型性能。该方法在不影响模型对高分辨率语音的处理性能的同时，可以在多个数据集上显著提高跨分辨率语音的识别性能。在VoxCeleb1数据集和CN-Celeb1数据集上，等错误率(EER)的数值达到最优值，分别为1.20%和1.61%。

关键词: 说话人验证, 说话人嵌入, 深度度量学习, 八元组损失, 三元组损失

Abstract:

Speaker verification in voiceprint recognition plays a key role in real-life applications such as human-computer interaction, medical diagnosis, and online meetings. Speaker embedding techniques based on Deep Neural Network (DNN) are being increasingly used in speaker verification tasks. Speaker verification under Open-Set is a multi-classification task, which essentially involves metric learning. The performance of existing metric learning techniques highly relies on large batches of high-resolution speech samples with labeled information. To solve this problem, this paper proposes a minimization same-class distance objective algorithm based on metric learning. This algorithm based on triplet loss first introduces the octuplet loss, which captures the relationship between high-resolution and low-resolution speech using four triplet loss terms. It then applies Hard Sample Mining techniques to select appropriate data triples to improve model classification accuracy. An online data augmentation strategy is introduced to address the problem of incorrect classification, which is caused by low-resolution speech in noisy environments, using Room Impulse Response (RIR) and Music, Speech, and Noise Corpus (MUSAN) datasets for model data enhancement. Following data enhancement and after introducing octuplet loss, the algorithm performs fine-tuning training on the Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network based speaker verification (ECAPA-TDNN) pre-model. The fine-tuned network can process low-resolution speech information in a noisy environment and improve model performance. This method can significantly improve cross-resolution speech recognition performance on multiple datasets without affecting the model′s performance in processing high-resolution speech. The Equal Error Rate (EER) on the VoxCeleb1 and CN-Celeb1 datasets reached the optimal values, which were 1.20% and 1.61%, respectively.

Key words: speaker verification, speaker embedding, deep metric learning, octuplet loss, triplet loss

宁美玲, 齐佳音. 基于八元组损失的跨分辨率说话人验证优化[J]. 计算机工程, 2025, 51(7): 111-118.

NING Meiling, QI Jiayin. Optimization of Cross-Resolution Speaker Verification Based on Octuplet Loss[J]. Computer Engineering, 2025, 51(7): 111-118.

https://www.ecice06.com/CN/Y2025/V51/I7/111

图/表 8

图1 三元组损失结构

Fig.1 Triplet loss structure

图2 八元组损失结构

Fig.2 Octuple loss structure

图3 八元组损失微调ECAPA-TDNN流程框架

Fig.3 Process framework of fine-tuning ECAPA-TDNN using the octuplet loss

图4 数据增强过程

Fig.4 Data augmentation process

图5 消融实验结果

Fig.5 Ablation experiment results

参考文献 29

1	SNYDER D, GARCIA-ROMERO D, SELL G, et al. x-vectors: robust DNN embeddings for speaker recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2018: 5329-5333.
2	曹书鑫, 冯藤藤, 葛凤培, 等. 基于尺度相关-双向长短期记忆网络模型的说话人识别. 计算机工程, 2023, 49 (4): 289- 296. doi: 10.19678/j.issn.1000-3428.0064388
	CAO S X , FENG T T , GE F P , et al. Speaker recognition based on scale correlation-bidirectional long short-term memory network model. Computer Engineering, 2023, 49 (4): 289- 296. doi: 10.19678/j.issn.1000-3428.0064388
3	刘晓璇, 季怡, 刘纯平. 基于LSTM神经网络的声纹识别. 计算机科学, 2021, 48 (S2): 270- 274.
	LIU X X , JI Y , LIU C P . Voiceprint recognition based on LSTM neural network. Computer Science, 2021, 48 (S2): 270- 274.
4	VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2014: 4052-4056.
5	DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification[C]//Proceedings of the Interspeech 2020. [S. l. ]: ISCA, 2020: 3830-3834.
6	POVEY D, CHENG G F, WANG Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks[C]//Proceedings of the Interspeech 2018. [S. l. ]: ISCA, 2018: 3743-3747.
7	张玉杰, 张赞. DenseNet在声纹识别中的应用研究. 计算机工程与科学, 2022, 44 (1): 132- 137.
	ZHANG Y J , ZHANG Z . Application of DenseNet in voiceprint recognition. Computer Engineering & Science, 2022, 44 (1): 132- 137.
8	CAI W C, CHEN J K, LI M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system[C]//Proceedings of Odyssey 2018. [S. l. ]: ISCA, 2018: 1-10.
9	XIAO R Q, MIAO X X, WANG W C, et al. Adaptive margin circle loss for speaker verification[C]//Proceedings of the Interspeech 2021. [S. l. ]: ISCA, 2021: 1-8.
10	KINNUNEN T , LI H Z . An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 2010, 52 (1): 12- 40.
11	LIAN J C, KUMAR A V, DHAMYAL H, et al. Masked proxy loss for text-independent speaker verification[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2011.04491v2.
12	CHUNG J S, HUH J, MUN S, et al. In defence of metric learning for speaker recognition[C]//Proceedings of the Interspeech 2020. [S. l. ]: ISCA, 2020: 2977-2981.
13	HANSEN J H L, WANG Z Y. Audio anti-spoofing using simple attention module and joint optimization based on additive angular margin loss and meta-learning[C]//Proceedings of the Interspeech 2022. [S. l. ]: ISCA, 2022: 376-380.
14	LI R D, FANG S, MA C G, et al. Adaptive rectangle loss for speaker verification[C]//Proceedings of the Interspeech 2022. [S. l. ]: ISCA, 2022: 301-305.
15	NOVOSELOV S, SHCHEMELININ V, SHULIPA A, et al. Triplet loss based cosine similarity metric learning for text-independent speaker recognition[C]//Proceedings of the Interspeech 2018. [S. l. ]: ISCA, 2018: 2242-2246.
16	WAN L, WANG Q, PAPIR A, et al. Generalized end-to-end loss for speaker verification[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2018: 4879-4883.
17	XIE W D, NAGRANI A, CHUNG J S, et al. Utterance-level aggregation for speaker recognition in the wild[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2019: 5791-5795.
18	KNOCHE M, ELKADEEM M, HORMANN S, et al. Octuplet loss: make face recognition robust to image resolution[C]//Proceedings of the IEEE 17th International Conference on Automatic Face and Gesture Recognition. Washington D. C., USA: IEEE Press, 2023: 1-8.
19	GAO Z F, SONG Y, MCLOUGHLIN I, et al. Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system[C]//Proceedings of the Interspeech 2019. [S. l. ]: ISCA, 2019: 361-365.
20	HERMANS A, BEYER L, LEIBE B, et al. In defense of the triplet loss for person re-identification[EB/OL]. [2023-05-10]. https://arxiv.org/abs/1703.07737v4.
21	NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[C]//Proceedings of the Interspeech 2017. [S. l. ]: ISCA, 2017: 1-10.
22	FAN Y, KANG J W, LI L T, et al. CN-Celeb: a challenging Chinese speaker recognition dataset[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2020: 7604-7608.
23	PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[C]//Proceedings of Interspeech 2019. [S. l. ]: ISCA, 2019: 2613-2617.
24	KINGMA D P, BA J. Adam: a method for stochastic optimization[C]//Proceedings of 2014 International Conference on Learning Representations (ICLR). Berlin, Germany: Springer, 2014: 1-15.
25	ZHOU D, WANG L B, LEE K A, et al. Dynamic margin softmax loss for speaker verification[C]//Proceedings of the Interspeech 2020. [S. l. ]: ISCA, 2020: 3800-3804.
26	LI Z, MAK M W. Speaker representation learning via contrastive loss with maximal speaker separability[C]//Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Washington D. C., USA: IEEE Press, 2022: 962-967.
27	HAN B, CHEN Z Y, QIAN Y M. Exploring binary classification loss for speaker verification[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA: IEEE Press, 2023: 1-5.
28	PRZYBOCKI M, MARTIN A, LE A. NIST speaker recognition evaluation Chronicles-part 2[C]//Proceedings of Odyssey 2006. Washington D. C., USA: IEEE Press, 2006: 1-10.
29	DENG J K, GUO J, XUE N N, et al. ArcFace: additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2019: 4690-4699.

[1]	郑宗生, 霍志俊, 高萌, 王政翰, 周文睆, 张月维. 基于类中心优化辅助三元组损失的遥感图像检索[J]. 计算机工程, 2025, 51(5): 305-313.
[2]	罗朔, 侯进, 谭光鸿, 韩雁鹏. 一种低参数的孪生卷积网络实时目标跟踪算法[J]. 计算机工程, 2021, 47(2): 84-89.

选择文件类型/文献管理软件名称

选择包含的内容