作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (4): 289-296. doi: 10.19678/j.issn.1000-3428.0064388

• 开发研究与工程应用 • 上一篇    下一篇

基于尺度相关‐双向长短期记忆网络模型的说话人识别

曹书鑫1, 冯藤藤1, 葛凤培2, 梁春燕1   

  1. 1. 山东理工大学 计算机科学与技术学院, 山东 淄博 255049;
    2. 北京邮电大学 图书馆, 北京 100876
  • 收稿日期:2022-04-06 修回日期:2022-05-10 发布日期:2022-05-26
  • 作者简介:曹书鑫(1996-),男,硕士研究生,主研方向为说话人识别;冯藤藤,硕士研究生;葛凤培,博士;梁春燕,副教授、博士。
  • 基金资助:
    国家自然科学基金(11704229)。

Speaker Recognition Based on Scale Correlation-Bidirectional Long Short-Term Memory Network Model

CAO Shuxin1, FENG Tengteng1, GE Fengpei2, LIANG Chunyan1   

  1. 1. School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, Shandong, China;
    2. Library, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2022-04-06 Revised:2022-05-10 Published:2022-05-26

摘要: 说话人识别通过语音对说话人进行身份认证,然而大部分语音在时域与频域具有分布多样性,目前说话人识别中的卷积神经网络深度学习模型普遍使用单一的卷积核进行特征提取,无法提取尺度相关特征及时频域特征。针对这一问题,提出一种尺度相关卷积神经网络-双向长短期记忆(SCCNN-BiLSTM)网络模型用于说话人识别。通过尺度相关卷积神经网络在每一层特征抽象过程中调整感受野大小,捕获由尺度相关块组成的尺度特征信息,同时引入双向长短期记忆网络保留与学习语音数据的多尺度特征信息,并在最大程度上提取时频域特征的上下文信息。实验结果表明,SCCNN-BiLSTM网络模型在LibriSpeech和AISHELL-1数据集上迭代50 000次时的等错率为7.21%和6.55%,相比于ResCNN基线网络模型提升了25.3%和41.0%。

关键词: 说话人识别, 深度学习, 尺度相关卷积, 感受野, 长短期记忆网络

Abstract: Speaker recognition identifies speakers based on their uttered speech.However, most of the speech exhibits diversity in the time-frequency domain.Currently, in the speaker recognition field, the deep learning models based on Convolutional Neural Network(CNN) generally uses a single convolution kernel for feature extraction, which fails to extract scale-related and time-frequency domain features.To solve this problem, a Scale Correlation CNN-Bidirectional Long Short-Term Memory(SCCNN-BiLSTM) network model is proposed for speaker recognition.The scale correlation CNN is used to adjust the receptive field size in the feature abstraction of each layer to capture the scale feature information composed of scale correlation blocks.Simultaneously, the BiLSTM network is introduced to retain and learn the multi-scale feature information of speech data, and the context information of time-frequency domain features is extracted to the maximum extent.The experimental results which obtained after 50 000 iterations show that the Equal Error Rate(EER) of the SCCNN-BiLSTM network model is 7.21% and 6.55% on the LibriSpeech and AISHELL-1 datasets, respectively.Compared with the Residual CNN(ResCNN) baseline network model, the EER of the SCCNN-BiLSTM network model increases by 25.3% and 41.0%, respectively.

Key words: speaker recognition, deep learning, scale correlation convolution, receptive field, Long Short-Term Memory(LSTM) network

中图分类号: