Full-Time Scale Speech Enhancement Method Based on GAN

doi:10.19678/j.issn.1000-3428.0065282

Abstract

Abstract: Existing speech enhancement methods cannot learn comprehensive time scale feature information in time-domain end-to-end speech enhancement.Furthermore，the modeling of intermediate layer sequences is insufficient.Therefore，this study proposes a method for speech denoising from a comprehensive time-scale.The input feature sequence is expanded by linear interpolation to obtain time features with higher resolution than the original input data，such that the model can extract features from a finer time scale.The feature encoded at each layer is sampled down using interval sampling，and an increasing number of high-dimensional features are calculated on coarser time scales to capture useful information at a deeper level.Simultaneously，ConformerBlock is introduced as the intermediate layer in the network model.The multi-head attention mechanism and the convolutional module enhance the sequence modeling ability of the intermediate layer of the network，highlight the representation information of the intermediate vector，and apply the principle of linear superposition of speech and noise to train and Generative Adversarial Network（GAN） through joint noise training.This enables the network to obtain useful information from the target speech and noise perspectives，further improving the denoising ability of the model.The experimental results on a publicly available speech enhancement test dataset show that the proposed method is significantly superior in terms of various indicators of speech denoising.For instance，the proposed model outperforms the Wave-U-net model in three main indicators，PESQ，STOI，and SSNR，by 2.75%，1.06%，and 6.34%，respectively.

Key words: full-time scale, high resolution, linear interpolation, Conformer module, Generative Adversarial Network（GAN）

摘要： 现有语音增强方法在时域端到端语音增强中无法学习全面时间尺度特征信息且中间层序列建模不充分。提出从全面时间尺度进行语音降噪的方法。通过线性插值方式对输入特征序列进行扩充，获得比原输入数据具有更高分辨率的时间特征，使得模型能够从更细时间尺度上进行特征提取，并利用间隔抽样方法对每一层编码后的特征进行下采样，在较粗时间尺度上计算越来越多的高维特征，使网络模型能够捕获深层次的有用信息。同时，在网络模型中引入ConformerBlock作为中间层，其中，多头注意力机制和卷积模块能够增强中间层网络的序列建模能力，突出中间向量的表征信息，根据语音和噪声线性叠加的原理，采用联合噪声训练生成对抗网络的方法使网络从目标语音和噪声2个角度获取有用信息，进一步提升模型降噪能力。在公开语音增强测试数据集上的实验结果表明，该方法降噪后语音的各项指标均得到显著提升，相比Wave-U-net模型，在PESQ、STOI和SSNR这3项主要指标上分别提升了2.75%、1.06%、6.34%。

关键词: 全时间尺度, 高分辨率, 线性插值, Conformer模块, 生成对抗网络

CLC Number:

TN912.35

SHEN Mengqiang, YU Wennian, YI Li, SONG Nan. Full-Time Scale Speech Enhancement Method Based on GAN[J]. Computer Engineering, 2023, 49(6): 115-122,130.

沈梦强, 于文年, 易黎, 宋南. 基于GAN的全时间尺度语音增强方法[J]. 计算机工程, 2023, 49(6): 115-122,130.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0065282

http://www.ecice06.com/EN/Y2023/V49/I6/115

Figures/Tables 11

References

[1] BOLL S.Suppression of acoustic noise in speech using spectral subtraction[J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1979,22(4):113-120.
[2] KAMATH S,LOIZOU P.A multi-band spectral subtraction method for enhancing speech corrupted by colored noise[C]//Proceedings of International Conference on Acoustics,Speech,and Signal Processing.Washington D.C.,USA:IEEE Press,2011:1-10.
[3] SCALART P,FILHO J V.Speech enhancement based on a priori signal to noise estimation[C]//Proceedings of International Conference on Acoustics,Speech,and Signal Processing.Washington D.C.,USA:IEEE Press,2002:629-632.
[4] DENDRINOS M,BAKAMIDIS S,CARAYANNIS G.Speech enhancement from noise:a regenerative approach[J].Speech Communication,1991,10(1):45-57.
[5] WANG D L.On ideal binary mask as the computational goal of auditory scene analysis[M]//Speech separation by humans and machines.Boston:Kluwer Academic Publishers,2006:181-197.
[6] LUO Y,MESGARANI N.Conv-TasNet:surpassing ideal time-frequency magnitude masking for speech separation[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019,27(8):1256-1266.
[7] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[8] PASCUAL S,BONAFONTE A,SERRÀ J.SEGAN:speech enhancement generative adversarial network[EB/OL].[2022-06-10].https://arxiv.org/pdf/1703.09452.pdf.
[9] FU S W,LIAO C F,TSAO Y,et al.MetricGAN:generative adversarial networks based black-box metric scores optimization for speech enhancement[EB/OL].[2022-06-10].https://arxiv.org/abs/1905.04874.
[10] FU S W,YU C,HSIEH T A,et al.MetricGAN+:an improved version of MetricGAN for speech enhancement[EB/OL].[2022-06-10].https://arxiv.org/abs/2104.03538.
[11] 袁文浩,时云龙,胡少东,等.一种基于时频域特征融合的语音增强方法[J].计算机工程,2021,47(10):75-81.YUAN W H,SHI Y L,HU S D,et al.A speech enhancement approach based on fusion of time-domain and frequency-domain features[J].Computer Engineering,2021,47(10):75-81.(in Chinese)
[12] STOLLER D,EWERT S,DIXON S.Wave-U-net:a multi-scale neural network for end-to-end audio source separation[EB/OL].[2022-06-10].https://arxiv.org/pdf/1806.03185.pdf.
[13] 武瑞沁,陈雪勤,俞杰,等.结合注意力机制的改进U-Net网络在端到端语音增强中的应用[J].声学学报,2022,47(2):266-275.WU R Q,CHEN X Q,YU J,et al.Application of improved U-Net network with attention mechanism in end-to-end speech enhancement[J].Acta Acustica,2022,47(2):266-275.(in Chinese)
[14] KIM J,EL-KHAMY M,LEE J.T-GSA:transformer with Gaussian-weighted self-attention for speech enhancement[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2020:6649-6653.
[15] YU W W,ZHOU J,WANG H B,et al.SETransformer:speech enhancement transformer[J].Cognitive Computation,2022,14(3):1152-1158.
[16] WANG K,HE B B,ZHU W P.TSTNN:two-stage transformer based neural network for speech enhancement in the time domain[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2021:7098-7102.
[17] DANG F,CHEN H T,ZHANG P Y.DPT-FSNet:dual-path transformer based full-band and sub-band fusion network for speech enhancement[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2022:6857-6861.
[18] GULATI A,QIN J,CHIU C C,et al.Conformer:convolution-augmented transformer for speech recognition[EB/OL].[2022-06-10].https://arxiv.org/abs/2005.08100.
[19] LI B,GULATI A,YU J H,et al.A better and faster end-to-end model for streaming ASR[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2021:5634-5638.
[20] KIM E,SEO H.SE-Conformer:time-domain speech enhancement using Conformer[C]//Proceedings of Interspeech 2021.[S.l]:ACM Press,2021:2736-2740.
[21] CAO R Z,ABDULATIF S,YANG B.CMGAN:conformer-based Metric GAN for speech enhancement[EB/OL].[2022-06-10].https://arxiv.org/abs/2203.15149v1.
[22] CHEN S Y,WU Y,CHEN Z,et al.Continuous speech separation with conformer[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2021:5749-5753.
[23] RETHAGE D,PONS J,SERRA X.A Wavenet for speech denoising[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2018:5069-5073.
[24] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2017:1-10.
[25] VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[EB/OL].[2022-06-10].https://www.cstr.ed.ac.uk/downloads/publications/2016/SSW9_Cassia_1.pdf.
[26] VEAUX C,YAMAGISHI J,KING S.The Voice Bank corpus:design,collection and data analysis of a large regional accent speech database[C]//Proceedings of International Conference Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation.Washington D.C.,USA:IEEE Press,2014:1-4.
[27] THIEMANN J,ITO N,VINCENT E.DEMAND:a collection of multi-channel recordings of acoustic noise in diverse environments[C]//Proceedings of Conference on Acoust.New York,USA:[s.n.],2013:1-6.
[28] BU H,DU J Y,NA X Y,et al.AISHELL-1:an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment.Washington D.C.,USA:IEEE Press,2018:1-5.
[29] VARGA A,STEENEKEN H J M.Assessment for automatic speech recognition:II.NOISEX-92:a database and an experiment to study the effect of additive noise on speech recognition systems[J].Speech Communication,1993,12(3):247-251.
[30] MACARTNEY C,WEYDE T.Improved speech enhancement with the Wave-U-net[EB/OL].[2022-06-10].https://arxiv.org/pdf/1811.11307.pdf.
[31] SONI M H,SHAH N,PATIL H A.Time-frequency masking-based speech enhancement using generative adversarial network[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2018:5039-5043.
[32] WANG K,HE B B,ZHU W P.CAUNet:context-aware U-net for speech enhancement in time domain[C]//Proceedings of International Symposium on Circuits and Systems.Washington D.C.,USA:IEEE Press,2021:1-5.

Please choose a citation manager

Content to export