Speech Signal Separation Based on Generative Adversarial Networks

doi:10.19678/j.issn.1000-3428.0053446

Abstract

Abstract: The single-channel speech separation based on deep learning needs to calculate the time-frequency masking,which,however,cannot be learnt in the existing methods.Moreover,the time-frequency masking is not encapsulated in in-depth learning for optimization,so it relies on Wiener filtering for subsequent processing.Therefore,this paper proposes a speech signal separation method based on Generative Adversarial Networks(GAN).In the speech generation stage,the recursive derivation algorithm and sparse encoder are introduced to improve the time-frequency generation results.Then,the generated speach is eatered into the discriminator for classification,so as to reduce the disturbance between signal sources.The experimental results show that compared with other speech signal separation methods,such as the codec-based method and the recurrent neural network-based method,the SDR and SIR separation indexes of the proposed method increase by 6.2 dB and 5.0 dB respectively.

Key words: single-channel speech separation, Generative Adversarial Networks(GAN), time-frequency masking, recursive derivation, sparse encoder

摘要： 基于深度学习的单声道语音分离需要计算时频掩蔽，但现有语音分离方法中时频掩蔽不可学习，也未将其封装到深度学习中进行优化，通常依赖于维纳滤波法进行后续处理。为此，提出一种基于生成对抗网络的语音信号分离方法。在语音生成阶段引入递归推导算法和稀疏编码器来改进时频掩蔽生成结果，并将生成的语音输入至判别器进行分类，以降低信号源之间的扰动。实验结果表明，与基于深度神经网络的语音信号分离方法相比，该方法的SDR、SIR分离指标分别提高6.2 dB和5.0 dB。

关键词: 单声道语音分离, 生成对抗网络, 时频掩蔽, 递归推导, 稀疏编码器

CLC Number:

TP391

LIU Hang, LI Yang, YUAN Haoqi, WANG Junying. Speech Signal Separation Based on Generative Adversarial Networks[J]. Computer Engineering, 2020, 46(1): 302-308.

刘航, 李扬, 袁浩期, 王俊影. 基于生成对抗网络的语音信号分离[J]. 计算机工程, 2020, 46(1): 302-308.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0053446

http://www.ecice06.com/EN/Y2020/V46/I1/302

Figures/Tables 12

References

[1] WOODRUFF J.Integrating monaural and binaural cues for sound localization and segregation in reverberant environments[EB/OL].[2018-12-02].http://pdfs.semanticscholar.org/6eda/6a32fa17c982dd0e28e66e6bb44530833fab.pdf.
[2] LIU Jing.Speech signal recognition based on blind source separation and noise suppression[J].Computer Measurement and Control,2018,26(12):140-144.(in Chinese)刘晶.基于盲源分离和噪声抑制的语音信号识别[J].计算机测量与控制,2018,26(12):140-144.
[3] WANG Fangjie,JIN Yun.Speech enhancement based on wiener filteringin digital hearing aids[J].Chinese Journal of Electron Devices,2017,40(4):1021-1025.(in Chinese)王方杰,金赟.基于维纳滤波的数字助听器语音增强算法[J].电子器件,2017,40(4):1021-1025.
[4] LIANG Shan,LIU Wenju,JIANG Wei.A new Bayesian method incorporating with local correlation for IBM estimation[J].IEEE Transactions on Audio,Speech,and Language Procssing,2013,21(3):476-487.
[5] MOHAMMADIHA N,SMARAGDIS P,LEIJON A.Supervised and unsupervised speech enhancement using nonnegative matrix factorization[J].IEEE Transactions on Audio,Speech,and Language Procssing,2013,21(10):2140-2151.
[6] JIANG Maosong,WANG Dongxia,NIU Fanglin,et al.Speech enhancement method based on sparsity-regularized non-negative matrix factorization[J].Journal of Computer Applications,2018,38(4):1176-1180.(in Chinese)蒋茂松,王冬霞,牛芳琳,等.稀疏正则非负矩阵分解的语音增强算法[J].计算机应用,2018,38(4):1176-1180.
[7] WANG D L,BROWN G.Computational auditory scene analysis:principles,algorithms,and applications[M].Washington D.C.,USA:IEEE Press,2006.
[8] QU Junling,LI Hongyan.Research on speech separation based on computational auditory scene analysis[J].Application Research of Computers,2014,31(12):3822-3824.(in Chinese)屈俊玲,李鸿燕.基于计算听觉场景分析的混合语音信号分离算法研究[J].计算机应用研究,2014,31(12):3822-3824.
[9] WANG Y X,NARAYANAN A,WANG D L.On training targets for supervised speech separation[J].IEEE Transactions on Audio,Speech,and Language Procssing,2014,22(12):1849-1858.
[10] GRAIS E M,SEN M U.Deep neural networks for single channel source separation[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2014:3734-3738.
[11] UHLICH S,GIRON F.Deep neural network based instrument extraction from music[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2015:2135-2139.
[12] LIANG Yao,ZHU Jie,MA Zhixian.A monaural speech separation algorithm based on deep neural networks[J].Information Technology,2018,42(7):24-27.(in Chinese)梁尧,朱杰,马志贤.基于深度神经网络的单通道语音分离算法[J].信息技术,2018,42(7):24-27.
[13] CHANDNA P,MIRON M,JANER J,et al.Monoaural audio source separation using deep convolutional neural networks[C]//Proceedings of the 13th International Conference on Latent Variable Analysis and Signal Separation.Washington D.C.,USA:IEEE Press,2017:258-266.
[14] XIA Shasha,ZHANG Xueliang,LIANG Shan.Supervised speech separation using optimal ratio mask[J].Acta Automatica Sinica,2018,44(10):1876-1887.(in Chinese)夏莎莎,张学良,梁山.基于优化浮值掩蔽的监督性语音分离[J].自动化学报,2018,44(10):1876-1887.
[15] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Proceedings of International Conference on Neural Information Processing Systems.Cambridge,USA:MIT Press,2014:2672-2680.
[16] CAO Yangjie,JIA Lili,CHEN Yongxia,et al.Review of computer vision based on generative adversarial networks[J].Journal of Image and Graphics,2018,23(10):1433-1449.(in Chinese)曹仰杰,贾丽丽,陈永霞,等.生成式对抗网络及其计算机视觉应用研究综述[J].中国图象图形学报,2018,23(10):1433-1449.
[17] DONAHUE C,LI B,PRABHAVALKAR R.Exploring speech enhancement with generative adversarial networks for robust speech recognition[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2018:5024-5028.
[18] YU Liang,WU Haijun,JIANG Weikang.Multi-channel speech enhancement based on beamforming and GAN network[J].Noise and Vibration Control,2018,38(S2):591-596.(in Chinese)余亮,吴海军,蒋伟康.结合波束形成和GAN网络的多通道语音增强研究[J].噪声与振动控制,2018,38(S2):591-596.
[19] HUANG Jianjun,ZHANG Xiongwei,ZHANG Yafei.Single channel speech enhancement via time-frequency dictionary learning[J].Acta Acustica,2012,37(5):539-547.(in Chinese)黄建军,张雄伟,张亚非.时频字典学习的单通道语音增强算法[J].声学学报,2012,37(5):539-547.
[20] ODUGU K,RAO B M S S.New speech enhancement using Gamma tone filters and perceptual wiener filtering based on sub banding[C]//Proceedings of International Conference on Signal Processing and Communication.Washington D.C.,USA:IEEE Press,2014:236-241.
[21] ISOLA P,ZHU J Y,ZHOU T H,et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:5967-5976.
[22] HUANG Gao,SUN Yu,LIU Zhuang,et al.Deep networks with stochastic depth[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2016:646-661.
[23] VARDAN P,ROMANO Y,ELAD M.Convolutional neural networks analyzed via convolutional sparse coding[J].Journal of Machine Learning Research,2016,18:2887-2938.
[24] HE Kaiming,ZHANG Xiangru,REN Shaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:770-778.
[25] HSU C L,JANG J S R.On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset[J].IEEE Transactions on Audio,Speech and Language Processing,2010,18(2):310-319.
[26] KINGMA D P,BA J.Adam:a method for stochastic optimization[EB/OL].[2018-12-02].https://arxiv.org/pdf/1412.6980.pdf.
[27] VINCENT E G.Performance measurement in blind audio source separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2006,14(4):1462-1469.

Please choose a citation manager

Content to export