作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (7): 111-118. doi: 10.19678/j.issn.1000-3428.0069055

• 人工智能与模式识别 • 上一篇    下一篇

基于八元组损失的跨分辨率说话人验证优化

宁美玲1, 齐佳音2,*()   

  1. 1. 上海对外经贸大学统计与信息学院,上海 200000
    2. 广州大学网络安全空间学院,广东 广州 510000
  • 收稿日期:2023-12-19 出版日期:2025-07-15 发布日期:2025-07-14
  • 通讯作者: 齐佳音
  • 基金资助:
    国家自然科学基金(72293583)

Optimization of Cross-Resolution Speaker Verification Based on Octuplet Loss

NING Meiling1, QI Jiayin2,*()   

  1. 1. School of Statistics and Information Science, Shanghai University of International Business and Economics, Shanghai 200000, China
    2. College of Cyberspace Security, Guangzhou University, Guangzhou 510000, Guangdong, China
  • Received:2023-12-19 Online:2025-07-15 Published:2025-07-14
  • Contact: QI Jiayin

摘要:

声纹识别中说话人验证在人机交互、医疗诊断和线上会议等现实领域起关键作用。基于深度神经网络(DNN)的说话人嵌入技术在说话人验证任务中变得越来越流行。Open-Set下的说话人验证是一个多分类任务,本质上是度量学习。现有的度量学习性能高度依赖于大批量具有标签信息的高分辨率语音样本。为了解决这个问题,提出一个基于度量学习的最小化相同类距离目标算法。该算法在三元组损失的基础上,引入八元组损失,利用4个三元组损失项捕捉高分辨率和低分辨率语音之间的关系,并运用困难样本挖掘技术来选择合适的数据三元组,使得模型分类更加准确。其次,为提升噪声干扰场景中低分辨率语音信号的分类鲁棒性,引入在线数据增强策略,使用RIR和MUSAN数据集对模型数据进行增强,利用数据增强后的数据和引入八元组损失对ECAPA-TDNN预模型进行微调训练,使得该微调网络能在噪声环境下处理低分辨率语音信息,提高模型性能。该方法在不影响模型对高分辨率语音的处理性能的同时,可以在多个数据集上显著提高跨分辨率语音的识别性能。在VoxCeleb1数据集和CN-Celeb1数据集上,等错误率(EER)的数值达到最优值,分别为1.20%和1.61%。

关键词: 说话人验证, 说话人嵌入, 深度度量学习, 八元组损失, 三元组损失

Abstract:

Speaker verification in voiceprint recognition plays a key role in real-life applications such as human-computer interaction, medical diagnosis, and online meetings. Speaker embedding techniques based on Deep Neural Network (DNN) are being increasingly used in speaker verification tasks. Speaker verification under Open-Set is a multi-classification task, which essentially involves metric learning. The performance of existing metric learning techniques highly relies on large batches of high-resolution speech samples with labeled information. To solve this problem, this paper proposes a minimization same-class distance objective algorithm based on metric learning. This algorithm based on triplet loss first introduces the octuplet loss, which captures the relationship between high-resolution and low-resolution speech using four triplet loss terms. It then applies Hard Sample Mining techniques to select appropriate data triples to improve model classification accuracy. An online data augmentation strategy is introduced to address the problem of incorrect classification, which is caused by low-resolution speech in noisy environments, using Room Impulse Response (RIR) and Music, Speech, and Noise Corpus (MUSAN) datasets for model data enhancement. Following data enhancement and after introducing octuplet loss, the algorithm performs fine-tuning training on the Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network based speaker verification (ECAPA-TDNN) pre-model. The fine-tuned network can process low-resolution speech information in a noisy environment and improve model performance. This method can significantly improve cross-resolution speech recognition performance on multiple datasets without affecting the model′s performance in processing high-resolution speech. The Equal Error Rate (EER) on the VoxCeleb1 and CN-Celeb1 datasets reached the optimal values, which were 1.20% and 1.61%, respectively.

Key words: speaker verification, speaker embedding, deep metric learning, octuplet loss, triplet loss