Speech Enhancement Network Based on Parallel Multi-Attention

doi:10.19678/j.issn.1000-3428.0068019

Abstract

Abstract:

Regarding the issue of the frequency-domain enhancement of speech affected by interference, a speech enhancement network based on a parallel multi-attention mechanism and an encoding and decoding structure, known as PMAN, is proposed. The network uses speech frequency-domain features obtained through a Short-Time Fourier Transform(STFT), including amplitude and complex spectra. The encoder integrates input data using dense convolutional modules. The parallel multi-attention module of the intermediate layer learns both local and global information in the frequency-domain and incorporates a Local Patch Attention(LPA) mechanism to capture the Two-Dimensional(2D) structure of the speech frequency-domain, achieving separation between clean speech and interference factors in the 2D space. The decoder integrates the learned information and generates amplitude masks and complex spectra separately. The final speech complex spectrum is obtained via weighted summation, and a joint time- and frequency-domain loss function is used to fuse the phase information. Experimental results on the VoiceBank+DEMAND speech dataset demonstrate that PMAN achieves better speech enhancement performance than a time-domain speech enhancement Neural Network based on a Two-Stage Transformer(TSTNN), with improvements of 10.8% in Perceptual Evaluation of Speech Quality(PESQ), 1.1% in Short-Time Objective Intelligibility(STOI), and 11.8% in Segmental Signal-to-Noise Ratio(SSNR).

Key words: speech enhancement, frequency-domain, multi-attention mechanism, Transformer network, parallel module

摘要：

针对受干扰语音的频域增强问题, 提出一种基于并行多注意力机制和编解码结构的语音增强网络(PMAN)。网络输入经过短时傅里叶变换(STFT)的语音频域特征, 包含振幅谱和复数谱, 编码器使用密集卷积模块对输入数据信息进行整合, 中间层的并行多注意力模块学习频域的局部和全局信息, 并融合局部块注意力(LPA)机制捕捉语音频域二维(2D)结构, 实现干净语音与干扰因素的2D层面分离。解码器将学习到的信息进行整合, 分别生成振幅掩模和复数频谱, 根据加权求和生成最终的语音复数频谱, 使用时域与频域联合损失函数实现相位信息的融合。在VoiceBank+DEMAND语音数据集上的实验结果表明, 与基于两阶段变换器的时域语音增强神经网络(TSTNN)相比, 经过PMAN增强后语音的客观语音质量评价(PESQ)、短时客观可懂度(STOI)、分段信噪比(SSNR)指标值分别提升10.8%、1.1%、11.8%, 具有较好的语音增强效果。

关键词: 语音增强, 频域, 多注意力机制, Transformer网络, 并行模块

Chi ZHANG, Zhong WANG, Tianhao JIANG, Kangmin XIE. Speech Enhancement Network Based on Parallel Multi-Attention[J]. Computer Engineering, 2024, 50(4): 68-77.

张池, 王忠, 姜添豪, 谢康民. 基于并行多注意力的语音增强网络[J]. 计算机工程, 2024, 50(4): 68-77.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068019

http://www.ecice06.com/EN/Y2024/V50/I4/68

Figures/Tables 12

Fig.1 Overall structure of network model

Fig.2 Structure of encoder

Fig.3 Speech feature

Fig.4 Structure of local patch attention layer

Fig.5 Structure of parallel Conformer module

Fig.6 Structure of Conformer layer

Fig.7 Structure of decoder

Fig.8 Speech spectrogram

Fig.9 Comparison curve of generalization ability

References 37

1	LOIZOU P C. Speech enhancement: theory and practice[M]. [S. l.]: CRC Press, 2013.
2	张雄伟, 李毅豪, 孙蒙, 等. 单通道语音增强中深度学习方法研究现状与展望. 陆军工程大学学报, 2022, (5): 1- 12. URL
	ZHANG X W, LI Y H, SUN M, et al. Methods of deep learning in monaural speech enhancement: state of art and prospects. Journal of Army Engineering University of PLA, 2022, (5): 1- 12. URL
3	BOLL S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27 (2): 113- 120. doi: 10.1109/TASSP.1979.1163209
4	ZALEVSKY Z, MENDLOVIC D. Fractional Wiener filter. Applied Optics, 1996, 35 (20): 3930- 3936. doi: 10.1364/AO.35.003930
5	EPHRAIM Y, MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32 (6): 1109- 1121. doi: 10.1109/TASSP.1984.1164453
6	EPHRAIM Y, VAN TREES H L. A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing, 1995, 3 (4): 251- 266. doi: 10.1109/89.397090
7	XU Y, DU J, DAI L R, et al. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014, 23 (1): 7- 19. doi: 10.1109/TASLP.2014.2364452
8	HU Y, LIU Y, LÜ S, et al. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement[EB/OL]. [2023-06-11]. https://arxiv.org/abs/2008.00264.
9	TAN K, WANG D L. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28, 380- 390. doi: 10.1109/TASLP.2019.2955276
10	DEFOSSEZ A, SYNNAEVE G, ADI Y. Real time speech enhancement in the waveform domain[EB/OL]. IEEE Access, 2020, 8: 48464-48476.
11	PASCUAL S, BONAFONTE A, SERRA J. SEGAN: speech enhancement generative adversarial network[EB/OL]. [2023-06-11]. https://arxiv.org/pdf/1703.09452.pdf.
12	FU S W, LIAO C F, TSAO Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]∥Proceedings of International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2019: 2031-2041.
13	沈梦强, 于文年, 易黎, 等. 基于GAN的全时间尺度语音增强方法. 计算机工程, 2023, 49 (6): 115-122, 130. URL
	SHEN M Q, YU W N, YI L, et al. Full-time scale speech enhancement method based on GAN. Computer Engineering, 2023, 49 (6): 115-122, 130. URL
14	WANG D L. On ideal binary mask as the computational goal of auditory scene analysis. Berlin, Germany: Springer, 2005.
15	NARAYANAN A, WANG D L. Ideal ratio mask estimation using deep neural networks for robust speech recognition[C]∥Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2013: 7092-7096.
16	WILLIAMSON D S, WANG Y X, WANG D L. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (3): 483- 492. doi: 10.1109/TASLP.2015.2512042
17	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
18	PARK H J, KANG B H, SHIN W, et al. MANNER: multi-view attention network for noise erasure[C]∥Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2022: 7842-7846.
19	WANG K, HE B B, ZHU W P. TSTNN: two-stage Transformer based neural network for speech enhancement in the time domain[C]∥Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2021: 7098-7102.
20	沈学利, 田桂源, 姜彦吉, 等. 基于双阶段Conv-Transformer的时频域语音增强算法. 计算机工程, 2023, 49 (6): 123- 130. doi: 10.19678/j.issn.1000-3428.0064966
	SHEN X L, TIAN G Y, JIANG Y J, et al. Time-frequency domain speech enhancement algorithm based on dual-stage Conv-Transformer. Computer Engineering, 2023, 49 (6): 123- 130. doi: 10.19678/j.issn.1000-3428.0064966
21	ZHAO S K, MA B, WATCHARASUPAT K N, et al. FRCRN: boosting feature representation using frequency recurrence for monaural speech enhancement[EB/OL]. [2023-06-11]. https://arxiv.org/abs/2206.07293.
22	WANG K P, LU W J, LIU P, et al. Multi-stage attention network for monaural speech enhancement. IET Signal Processing, 2023, 17 (3): e12182. doi: 10.1049/sil2.12182
23	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]∥Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 3-19.
24	GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented Transformer for speech recognition[EB/OL]. [2023-06-11]. https://arxiv.org/abs/2005.08100.
25	BRAUN S, TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement[EB/OL]. [2023-06-11]. https://arxiv.org/abs/2009.12286.
26	DING L, TANG H, BRUZZONE L. LANet: local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 2021, 59 (1): 426- 435. doi: 10.1109/TGRS.2020.2994150
27	DAI Y M, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]∥Proceedings of IEEE Winter Conference on Applications of Computer Vision(WACV). Washington D. C., USA: IEEE Press, 2021: 3560-3569.
28	LU Y P, LI Z H, HE D, et al. Understanding and improving Transformer from a multi-particle dynamic system point of view[EB/OL]. [2023-06-11]. https://arxiv.org/abs/1906.02762.
29	SHAZEER N. GLU variants improve Transformer[EB/OL]. [2023-06-11]. https://arxiv.org/abs/2002.05202.
30	BRAUN S, GAMPER H, REDDY C K A, et al. Towards efficient models for real-time deep noise suppression[C]∥Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2021: 656-660.
31	VALENTINI-BOTINHAO C. Noisy speech database for training speech enhancement algorithms and TTS models[J]. Edinburgh, UK: University of Edinburgh, 2017.
32	HU Y, LOIZOU P C. Subjective comparison and evaluation of speech enhancement algorithms. Speech Communication, 2007, 49 (7): 588- 601. URL
33	HU Y, LOIZOU P C. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16 (1): 229- 238. doi: 10.1109/TASL.2007.911054
34	RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual Evaluation of Speech Quality(PESQ): a new method for speech quality assessment of telephone networks and codecs[C]∥Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Washington D. C., USA: IEEE Press, 2001: 749-752.
35	TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19 (7): 2125- 2136. doi: 10.1109/TASL.2011.2114881
36	HANSEN J H L, PELLOM B L. An effective quality evaluation protocol for speech enhancement algorithms[EB/OL]. [2023-06-11]. https://www.semanticscholar.org/paper/An-effective-quality-evaluation-protocol-for-speech-Hansen-Pellom/497418c70971c8d990e2edf989d6f05675b7c23a.
37	YIN D C, LUO C, XIONG Z W, et al. PHASEN: a phase-and-harmonics-aware speech enhancement network[C]∥Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 9458-9465.

[1]	SHEN Xueli, TIAN Guiyuan, JIANG Yanji, MA Linlin. Time-Frequency Domain Speech Enhancement Algorithm Based on Dual-Stage Conv-Transformer [J]. Computer Engineering, 2023, 49(6): 123-130.
[2]	ZHANG Jiong, WANG Lifang, LIN Suzhen, QIN Pinle, MI Jia, LIU Yang. Medical Image Fusion with Local-Global Feature Coupling and Cross-Scale Attention [J]. Computer Engineering, 2023, 49(3): 238-247.
[3]	LI Jianghe, WANG Mei. A Gated Recurrent Neural Network for Causal Speech Enhancement [J]. Computer Engineering, 2022, 48(11): 77-82.
[4]	GONG Faming, XU Chenxi, LI Juejin. Adversarial Deep Learning-based Unauthorized Construction Site Recognition Using UAV-assisted Aerial Photography [J]. Computer Engineering, 2022, 48(1): 275-280,287.
[5]	ZHANG Huan, ZHAO Ximei. Identification of Liver Cirrhosis Based on STN and Heterogeneous Convolution Filter [J]. Computer Engineering, 2021, 47(5): 301-307,315.
[6]	YUAN Wenhao, SHI Yunlong, HU Shaodong, LOU Yingxi. A Speech Enhancement Approach Based on Fusion of Time-Domain and Frequency-Domain Features [J]. Computer Engineering, 2021, 47(10): 75-81.
[7]	LOU Yingxi, YUAN Wenhao, PENG Rongqun. Speech Enhancement Method Based on Quasi Recurrent Neural Network [J]. Computer Engineering, 2020, 46(4): 316-320.
[8]	YUAN Wenhao, LIANG Chunyan, XIA Bin. Causal Speech Enhancement Model Based on Deep Neural Network [J]. Computer Engineering, 2019, 45(8): 255-259.
[9]	GAO Dongdong,ZHANG Xinsheng. Image Saliency Detection Based on Spatial Convolutional Neural Network Model [J]. Computer Engineering, 2018, 44(5): 240-245.
[10]	CAI Jun,LI Fei,ZHANG Yi. Speech Enhancement Algorithm Based on Auditory Masking Effect [J]. Computer Engineering, 2017, 43(7): 288-292,297.
[11]	LI Shidong,ZHOU Zhigang,XIE Zhenshan. Receiving Scheme for Universal Filtered Multi-carrier Based on Filter Separation [J]. Computer Engineering, 2017, 43(10): 72-76.
[12]	SUN Chengli,MU Junsheng. Subspace Speech Enhancement Algorithm Based on Eigenvalue Substitution [J]. Computer Engineering, 2016, 42(2): 272-277,282.
[13]	YI Qingming,ZENG Jielin,SHI Min. A Variable Step-size Frequency-domain Least Mean Square Algorithm Based on Vector Acceleration [J]. Computer Engineering, 2015, 41(7): 285-288,293.
[14]	SUN Bao-yin,ZHOU Qiang,ZHU Jun-jie,NI Sa-hua,TAO Zhi,GU Ji-hua. Speech Enhancement for Cochlear Implant Based on Improved Gain Function [J]. Computer Engineering, 2014, 40(8): 237-241.
[15]	QIN Ai-Na, DAI Liang, GUI Wei-Hua. Speech Enhancement Algorithm Based on Auditory Masking Effect and Optimal Smoothing [J]. Computer Engineering, 2013, 39(8): 27-30,37.

Please choose a citation manager

Content to export