Speaker Recognition Based on Scale Correlation-Bidirectional Long Short-Term Memory Network Model

doi:10.19678/j.issn.1000-3428.0064388

Abstract

Abstract: Speaker recognition identifies speakers based on their uttered speech.However, most of the speech exhibits diversity in the time-frequency domain.Currently, in the speaker recognition field, the deep learning models based on Convolutional Neural Network(CNN) generally uses a single convolution kernel for feature extraction, which fails to extract scale-related and time-frequency domain features.To solve this problem, a Scale Correlation CNN-Bidirectional Long Short-Term Memory(SCCNN-BiLSTM) network model is proposed for speaker recognition.The scale correlation CNN is used to adjust the receptive field size in the feature abstraction of each layer to capture the scale feature information composed of scale correlation blocks.Simultaneously, the BiLSTM network is introduced to retain and learn the multi-scale feature information of speech data, and the context information of time-frequency domain features is extracted to the maximum extent.The experimental results which obtained after 50 000 iterations show that the Equal Error Rate(EER) of the SCCNN-BiLSTM network model is 7.21% and 6.55% on the LibriSpeech and AISHELL-1 datasets, respectively.Compared with the Residual CNN(ResCNN) baseline network model, the EER of the SCCNN-BiLSTM network model increases by 25.3% and 41.0%, respectively.

Key words: speaker recognition, deep learning, scale correlation convolution, receptive field, Long Short-Term Memory(LSTM) network

摘要： 说话人识别通过语音对说话人进行身份认证，然而大部分语音在时域与频域具有分布多样性，目前说话人识别中的卷积神经网络深度学习模型普遍使用单一的卷积核进行特征提取，无法提取尺度相关特征及时频域特征。针对这一问题，提出一种尺度相关卷积神经网络-双向长短期记忆（SCCNN-BiLSTM）网络模型用于说话人识别。通过尺度相关卷积神经网络在每一层特征抽象过程中调整感受野大小，捕获由尺度相关块组成的尺度特征信息，同时引入双向长短期记忆网络保留与学习语音数据的多尺度特征信息，并在最大程度上提取时频域特征的上下文信息。实验结果表明，SCCNN-BiLSTM网络模型在LibriSpeech和AISHELL-1数据集上迭代50 000次时的等错率为7.21%和6.55%，相比于ResCNN基线网络模型提升了25.3%和41.0%。

关键词: 说话人识别, 深度学习, 尺度相关卷积, 感受野, 长短期记忆网络

CLC Number:

TP391.42

CAO Shuxin, FENG Tengteng, GE Fengpei, LIANG Chunyan. Speaker Recognition Based on Scale Correlation-Bidirectional Long Short-Term Memory Network Model[J]. Computer Engineering, 2023, 49(4): 289-296.

曹书鑫, 冯藤藤, 葛凤培, 梁春燕. 基于尺度相关‐双向长短期记忆网络模型的说话人识别[J]. 计算机工程, 2023, 49(4): 289-296.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064388

http://www.ecice06.com/EN/Y2023/V49/I4/289

Figures/Tables 16

References

[1] ROSE R C, REYNOLDS D A.Text independent speaker identification using automatic acoustic segmentation[C]//Proceedings of International Conference on Acoustics, Speech, and Signal Processing.Washington D.C., USA:IEEE Press, 2002:293-296.
[2] REYNOLDS D A, QUATIERI T F, DUNN R B.Speaker verification using adapted Gaussian mixture models[J].Digital Signal Processing, 2000, 10(1/2/3):19-41.
[3] CORTES C, VAPNIK V.Support-vector networks[J].Machine Learning, 1995, 20(3):273-297.
[4] KENNY P, BOULIANNE G, OUELLET P, et al.Joint factor analysis versus eigenchannels in speaker recognition[J].IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(4):1435-1447.
[5] DEHAK N, KENNY P J, DEHAK R, et al.Front-end factor analysis for speaker verification[J].IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(4):788-798.
[6] PRINCE S J D, ELDER J H.Probabilistic linear discriminant analysis for inferences about identity[C]//Proceedings of the 11th International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2007:1-8.
[7] AWAD M, KHANNA R.Efficient learning machines[M].Berkeley, USA:Apress, 2015.
[8] LECUN Y, BOSER B, DENKER J S, et al.Backpropagation applied to handwritten zip code recognition[J].Neural Computation, 1989, 1(4):541-551.
[9] GRAVES A, MOHAMED A R, HINTON G.Speech recognition with deep recurrent neural networks[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2013:6645-6649.
[10] LI C, MA X K, JIANG B, et al.Deep speaker:an end-to-end neural speaker embedding system[EB/OL].[2022-03-11].https://arxiv.org/abs/1705.02304.
[11] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[12] 吴震东, 潘树诚, 章坚武.基于CNN的连续语音说话人声纹识别[J].电信科学, 2017, 33(3):59-66. WU Z D, PAN S C, ZHANG J W.Continuous speech speaker recognition based on CNN[J].Telecommunications Science, 2017, 33(3):59-66.(in Chinese)
[13] TORFI A, DAWSON J, NASRABADI N M.Text-independent speaker verification using 3D convolutional neural networks[C]//Proceedings of IEEE International Conference on Multimedia and Expo.Washington D.C., USA:IEEE Press, 2018:1-6.
[14] YADAV S, RAI A.Frequency and temporal convolutional attention for text-independent speaker recognition[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2020:6794-6798.
[15] WOO S, PARK J, LEE J Y, et al.CBAM:convolutional block attention module[M].Berlin, Germany:Springer, 2018.
[16] 王鹏程, 崔敏, 李剑, 等.基于深度学习的小样本声目标识别方法[J].计算机测量与控制, 2021, 29(4):217-221. WANG P C, CUI M, LI J, et al.Small sample acoustic target recognition method based on deep learning[J].Computer Measurement & Control, 2021, 29(4):217-221.(in Chinese)
[17] WANG P Q, CHEN P F, YUAN Y, et al.Understanding convolution for semantic segmentation[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision.Washington D.C., USA:IEEE Press, 2018:1451-1460.
[18] SCHUSTER M, PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing, 1997, 45(11):2673-2681.
[19] DAVIS S, MERMELSTEIN P.Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J].IEEE Transactions on Acoustics, Speech, and Signal Processing, 1980, 28(4):357-366.
[20] WANG J, LI L T, WANG D, et al.Research on generalization property of time-varying FBank-weighted MFCC for i-vector based speaker verification[C]//Proceedings of the 9th International Symposium on Chinese Spoken Language Processing.Washington D.C., USA:IEEE Press, 2014:423.
[21] SZEGEDY C, LIU W, JIA Y Q, et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:1-9.
[22] YIN W, SCHÜTZE H.Multichannel variable-size convolution for sentence classification[EB/OL].[2022-03-11].https://arxiv.org/abs/1603.04513.
[23] 李昊轩.基于深度学习的音频事件分类研究[D].北京:北京邮电大学, 2020. LI H X.Research on audio event classification based on deep learning[D].Beijing:Beijing University of Posts and Telecommunications, 2020.(in Chinese)
[24] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.
[25] DEHAK N, DEHAK R, GLASS J, et al.Cosine similarity scoring without score normalization techniques[C]//Proceedings of Speaker and Language Recognition Workshop.Berlin, Germany:Springer, 2010:71-75.
[26] PANAYOTOV V, CHEN G G, POVEY D, et al.Librispeech:an ASR corpus based on public domain audio books[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2015:5206-5210.
[27] BU H, DU J Y, NA X Y, et al.AISHELL-1:an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment.Washington D.C., USA:IEEE Press, 2018:1-5.
[28] SCHROFF F, KALENICHENKO D, PHILBIN J.FaceNet:a unified embedding for face recognition and clustering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:815-823.
[29] DELGADO H, EVANS N, KINNUNEN T, et al.ASVspoof 2021:automatic speaker verification spoofing and countermeasures challenge evaluation plan[EB/OL].[2022-03-11].https://arxiv.org/abs/2109.00535.
[30] 殷兵.NIST说话人识别评测进展综述[C]//第一届全国声像资料检验鉴定技术交流会议论文集.北京:中国感光学会, 2011. YIN B.Overview of NIST speaker recognition evaluation progress[C]//Proceedings of the 1st National Conference on Inspection and Identification of Audiovisual Data.Beijing:China Photographic Society, 2011.(in Chinese)
[31] MARTIN A, PRZYBOCKI M.The NIST 1999 speaker recognition evaluation-an overview[J].Digital Signal Processing, 2000, 10(1/2/3):1-18.
[32] 李富强, 万红, 黄俊杰.基于MATLAB的语谱图显示与分析[J].微计算机信息, 2005, 21(10X):172-176. LI F Q, WAN H, HUANG J J.The display and analysis of sonogram based on MATLAB[J].Microcomputer Information, 2005, 21(10X):172-176.(in Chinese).

Please choose a citation manager

Content to export