引入非局部模块卷积神经网络的基频提取模型

doi:10.19678/j.issn.1000-3428.0063987

摘要/Abstract

摘要： 基频或基音的估计是各种语音信号处理技术的关键子问题，现有信号处理技术研究多使用数据驱动的方法，即通过卷积神经网络进行基频提取。然而，卷积神经网络中的卷积操作一次只能处理局部的音频样本点，只有在递归应用卷积操作时才能捕获全局音频样本点依赖关系，导致计算效率低与优化困难。受非局部模块在计算机视觉任务中具有较高性能的启发，提出一种具有非局部模块的卷积神经网络用于基频提取任务。非局部模块相比不断堆叠的卷积神经网络，可以直接计算两个位置之间的关系，由于其可以忽略欧氏距离，因此能够快速捕获长范围的依赖关系。对于基频估计任务，可在卷积神经网络中加入非局部模块以计算音频样本点之间的相似性，有助于捕获帧与帧和样本点与样本点之间的全局依赖关系，且非局部模块可以保持输入输出维度不变，能够快速地集成卷积神经网络。实验结果表明，该方法平均绝对误差仅为4.7，与基线模型相比，至少降低了0.7，能够获得最佳的模型性能。

关键词: 基频, 语音信号处理, 数据驱动, 卷积神经网络, 非局部模块

Abstract: Estimating the fundamental frequency or pitch is a key sub-problem in various speech signal processing techniques.Recent studies use a data-driven approach, namely, fundamental frequency extraction with Convolutional Neural Network (CNN).However, the convolution operation in CNN can only process local audio sample points at a given time, and the global audio sample point dependencies can only be captured when the convolution operation is applied recursively.However, this introduces computational inefficiency and optimization difficulties.Inspired by the impressive performance of non-local modules in many computer vision tasks, this study proposes a CNN with non-local modules to undertake the fundamental frequency extraction task.Compared with the continuously stacked CNN, CNN with non-local modules can effectively obtain the relationship between two positions, that is, they can quickly capture long-range dependencies because they ignore the Euclidean distance.In the pitch estimation task, when non-local modules are added to CNNs to calculate the similarity between all audio sample points in each frame, they help capture the global dependencies between frame-to-frame and sample-to-sample with slightly increased computational complexity.Moreover, non-local modules do not alter the input and output dimensions;thus, they can be easily integrated with CNN.The experimental results demonstrate that the Mean Absolute Error (MAE) of the proposed method is only 4.7, which is at least 0.7 lower than that of the baseline model, and state-of-the-art performance is obtained.

Key words: fundamental frequency, speech signal processing, data-driven, Convolutional Neural Network(CNN), non-local modules

中图分类号:

TP183

刘晶晶, 黄浩. 引入非局部模块卷积神经网络的基频提取模型[J]. 计算机工程, 2023, 49(3): 128-133,160.

LIU Jingjng, HUANG Hao. Fundamental Frequency Extraction Model Using Convolutional Neural Networks with Non-local Modules[J]. Computer Engineering, 2023, 49(3): 128-133,160.

https://www.ecice06.com/CN/Y2023/V49/I3/128

图/表 9

20230314185214

20230314185223

20230314185227

20230314185230

20230314185233

20230314185237

20230314185240

20230314185243

20230314185246

参考文献

[1] WANG X, TAKAKI S, YAMAGISHI J.An RNN-based quantized F0 model with multi-tier feedback links for text-to-speech synthesis[C]//Proceedings of INTERSPEECH'17.Washington D.C., USA:IEEE Press, 2017:1059-1063.
[2] GHAHREMANI P, BABAALI B, POVEY D, et al.A pitch extraction algorithm tuned for automatic speech recognition[C]//Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2014:2494-2498.
[3] ATAL B S.Automatic speaker recognition based on pitch contours[J].The Journal of the Acoustical Society of America, 1972, 52(6):1687-1697.
[4] KATO A, MILNER B.Using hidden Markov models for speech enhancement[C]//Proceedings of INTERSPEECH'14.Washington D.C., USA:IEEE Press, 2014:5695-5699.
[5] NOLL A M.Cepstrum pitch determination[J].Journal of Clinical Sleep Medicine, 1967, 41(2):293-309.
[6] DUBNOWSKI J, SCHAFER R, RABINER L.Real-time digital hardware pitch detector[J].IEEE Transactions on Acoustics, Speech, and Signal Processing, 1976, 24(1):2-8.
[7] ROSS M, SHAFFER H, COHEN A, et al.Average magnitude difference function pitch extractor[J].IEEE Transactions on Acoustics, Speech, and Signal Processing, 1974, 22(5):353-362.
[8] TALKIN D.A robust algorithm for pitch tracking[J].Speech Coding and Synthesis, 1995, 44:495-518.
[9] BOERSMA P.Praat, a system for doing phonetics by computer[J].Glot International, 2002, 5(9/10):341-345.
[10] DE CHEVEIGNÉ A, KAWAHARA H.YIN:a fundamental frequency estimator for speech and music[J].The Journal of the Acoustical Society of America, 2002, 111(4):1917-1930.
[11] GONZALEZ S, BROOKES M.A pitch estimation filter robust to high levels of noise[C]//Proceedings of the 19th European Signal Processing Conference.Berlin, Germany:Springer, 2011:451-455.
[12] CAMACHO A, HARRIS J G.A sawtooth waveform inspired pitch estimator for speech and music[J].The Journal of the Acoustical Society of America, 2008, 124(3):1638-1652.
[13] MAUCH M, DIXON S.PYIN:a fundamental frequency estimator using probabilistic threshold distributions[C]//Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2014:659-663.
[14] GU Y H.HMM-based noisy-speech pitch contour estimation[C]//Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing.Washington D.C., USA:IEEE Press, 1992:21-24.
[15] NISHIMOTO T, SAGAYAMA S, KAMEOKA H.Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear Kalman filtering[C]//Proceedings of INTERSPEECH'04.Washington D.C., USA:IEEE Press, 2004:2433-2436.
[16] WALMSLEY P J, GODSILL S J, RAYNER P J W.Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters[C]//Proceedings of IEEE Workshop on Applications of Signal Processing to Audio & Acoustics.Washington D.C., USA:IEEE Press, 1999:119-122.
[17] HAN K, WANG D L.Neural network based pitch tracking in very noisy speech[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12):2158-2168.
[18] KATO A, KINNUNEN T.Waveform to single sinusoid regression to estimate the F0 contour from noisy speech using recurrent deep neural networks[EB/OL].[2022-01-10].https://arxiv.org/abs/1807.00752.
[19] KIM J W, SALAMON J, LI P, et al.Crepe:a convolutional representation for pitch estimation[C]//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2018:161-165.
[20] ARDAILlON L, ROEBEL A.Fully-convolutional network for pitch estimation of speech signals[C]//Proceedings of INTERSPEECH'19.Washington D.C., USA:IEEE Press, 2019:2005-2009.
[21] WANG X, GIRSHICK R, GUPTA A, et al.Non-local neural networks[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7794-7803.
[22] KINGMA D, BA J.Adam:a method for stochastic optimization[C]//Proceedings of International Conference on Learning Representations.Washington D.C., USA:IEEE Press, 2014:564-578.
[23] PIRKER G, WOHLMAYR M, PETRIK S, et al.A pitch tracking corpus with evaluation on multipitch tracking scenario[C]//Proceedings of INTERSPEECH'11.Washington D.C., USA:IEEE Press, 2011:1509-1512.
[24] LAMEL L F, KASSEL R H, SENEFF S.Speech database development:design and analysis of the acoustic-phonetic corpus[J].Speech Input/Output Assessment and Speech Databases, 1989(2):2161-2170.
[25] WANG W J, LU Y M.Analysis of the mean absolute error and the root mean square error in assessing rounding model[J].Materials Science and Engineering, 2018, 324:012049.
[26] RABINER L, CHENG M, ROSENBERG A, et al.A comparative performance study of several pitch detection algorithms[J].IEEE Transactions on Acoustics, Speech, and Signal Processing, 1976, 24(5):399-418.
[27] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of Advances in Neural Information Processing System.Cambridge, USA:MIT Press, 2017:30.

选择文件类型/文献管理软件名称

选择包含的内容