融合时频Transformer的U-Net语音增强网络

doi:10.19678/j.issn.1000-3428.0253510

摘要/Abstract

摘要： 语音增强技术面临的一个挑战是现有基于Transformer的方法在局部特征建模上存在不足，难以准确还原语音中的高频细节与瞬态成分。为解决这一问题，研究设计了一种融合时频Transformer的U-Net语音增强网络，旨在通过改进注意力机制和特征融合来提升去噪效果。该网络设计了并行时频联合注意力模块，对时域与频域数据进行显式区分与并行处理；同时在瓶颈层引入局部-全局特征协同模块，将密集连接空洞空间金字塔池化的多尺度局部特征提取能力与Transformer的全局建模优势结合。局部-全局特征协同模块通过动态特征校准机制，实现多尺度局部上下文与全局依赖关系的协同，从而增强对语音结构的感知。网络采用频谱映射方式，通过短时傅里叶变换将语音转换为时频表示，经处理后再通过逆短时傅里叶变换重构为时域信号。在由纯净语音数据集LibriSpeech和噪声语音数据集ESC-50数据集、哥伦比亚大学噪声库构建的10小时训练集和1小时验证集上，该网络在多项客观指标上表现优异，语音质量感知评估达3.37，短时客观可懂度达97%，尺度不变信噪失真比达19.97dB，超越了多种现有先进模型。

Abstract: A challenge in speech enhancement is that existing Transformer-based methods are insufficient in modeling local features, making it difficult to accurately restore high-frequency details and transient components in speech. To address this issue, a U-Net speech enhancement network integrating a time–frequency Transformer was designed, aiming to improve denoising performance by refining the attention mechanism and feature fusion. The network incorporates a parallel time–frequency joint attention module that explicitly distinguishes and processes time-domain and frequency-domain data in parallel. Additionally, a local–global feature collaboration module is introduced at the bottleneck layer, combining the multi-scale local feature extraction capability of densely connected atrous spatial pyramid pooling with the global modeling advantages of the Transformer. This local–global feature collaboration module employs a dynamic feature calibration mechanism to achieve synergy between multi-scale local context and global dependencies, thereby enhancing perception of speech structure. The network adopts a spectral mapping approach, converting speech into a time–frequency representation via short-time Fourier transform, processing it, and then reconstructing the time-domain signal through inverse short-time Fourier transform. On a 10-hour training set and a 1-hour validation set constructed from the clean speech dataset LibriSpeech and the noise speech datasets ESC-50 and the Columbia University Noise Library, the network achieved excellent performance on multiple objective metrics. The PESQ reached 3.37, the STOI was 97%, and the SI-SDR reached 19.97 dB, surpassing several existing state-of-the-art models.

王小生, 方小红, 杨浩, 刘一宁, 郭桥生, 刘超飞. 融合时频Transformer的U-Net语音增强网络[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253510.

WANG Xiaosheng, FANG Xiaohong, YANG Hao, LIU Yining, GUO Qiaosheng, LIU Chaofei. U-Net Speech Enhancement Network Integrated with Time-Frequency Transformer[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253510.

参考文献

[1] O’SHAUGHNESSY D. Speech Enhancement-A Review of Modern Methods[J]. IEEE Transactions on Human-Machine Systems, 2024, 54(1): 110–120.
[2] AAMIR W, ZAREEN A, SAIRA K, et al. Generative adversarial networks for speech processing: A review[J]. Computer Speech & Language, 2022, 72: 101308.
[3] SURESHKUMAR N, SYED A, FAISUL A, et al. Deep neural networks for speech enhancement and speech recognition: A systematic review[J]. Ain Shams Engineering Journal, 2025, 16(7): 103405.
[4] LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE-ACM transactions on audio speech and language processing, 2019, 27(08): 1256-1266.
[5] PANDEY A, WANG D. A New Framework for CNN-Based Speech Enhancement in the Time Domain[J]. IEEE-ACM transactions on audio speech and language processing, 2019, 27(07): 1179-1188.
[6] RETHAGE D, PONS J, SERRA X. A wavenet for speech denoising[J]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, 5069-5073.
[7] KIM J, El-KHAMY M, LEE J. T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement[J]. IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, 6649–6653.
[8] LI A, LIU W, ZHENG C, et al. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement[J]. IEEE-ACM Transactions on Audio, Speech and Language Processing. 2021, 29: 1829-1843.
[9] HAO X, SU X, HORAUD. et al. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement[J]. IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 6633-6637.
[10] 吕景刚，彭绍睿，高硕，等.复频域注意力和多尺度频域增强驱动的语音增强网络[J].计算机应用，2025，45(09)：2957-2965. LV J G, PENG S R, GAO S, et al. Speech Enhancement Network Driven by Complex Frequency-Domain Attention and Multi-Scale Frequency-Domain Enhancement [J]. Journal of Computer Applications, 2025, 45(09): 2957-2965.
[11] 张池，王忠，姜添豪，等.基于并行多注意力的语音增强网络[J].计算机工程，2024，50(04)：68-77. ZHANG C, WANG Z, JIANG T H, et al. Speech Enhancement Network Based on Parallel Multi-Attention[J]. Computer Engineering, 2024, 50(04): 68-77.
[12] SHI H, MIMURA M, KAWAHARA T. Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition[J]. IEEE-ACM transactions on audio speech and language processing. 2024, 32: 3049-3060.
[13] KONG Z, PING W, DANTREY A, et al. Speech Denoising in the Waveform Domain With Self-Attention[J]. ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 7867-7871.
[14] DANG F, CHEN H, ZHANG P. DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement[J]. ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 6857-6861.
[15] ZHAO S, NGUYEN T H, MA B. Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses[J]. IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 6648-6652.
[16] XU S, CAO Y, ZHANG Z, et al. Two-stage Unet with channel and temporal-frequency attention for multi-channel speech enhancement[J]. Speech Communication, 2025, 166: 103154.
[17] 李现国，李滨.基于Transformer和多尺度CNN的图像去模糊[J].计算机工程，2023，49(09)：226-233. LI X G, LI B. Image Deblurring Based on Transformer and Multi-Scale CNN[J]. Computer Engineering, 2023, 49(09): 226-233.
[18] SONI S, YADAV R N, GUPTA L.State-of-the-Art Analysis of Deep Learning-Based Monaural Speech Source Separation Techniques[J]. IEEE ACCESS,2023,11:4242-4269
[19] HU P J, LI X, TIAN Y, et al. Automatic Pancreas Segmentation in CT Images With Distance-Based Saliency-Aware DenseASPP Network[J]. IEEE journal of biomedical and health informatics, 2021, 25(5): 1601-1611.
[20] LATIF S, ZAIDI SAM, S.A.MCUAYAHUITL H, et al. Transformers in speech processing: Overcoming challenges and paving the future[J]. computer science review, 2025, 58: 100768.
[21] XIAO H, LI L, LIU Q, et al. Transformers in medical image segmentation: A review[J]. Biomedical Signal Processing and Control, 2023, 84: 104791.
[22] CHEN H, XU Y, KE D, et al. DDP-Unet: A mapping neural network for single-channel speech enhancement[J]. Computer Speech and Language, 2025, 93: 101795.
[23] OPENSLR. LibriSpeech dataset[DB/OL]. [2025-03-15]. http://www.openslr.org/12/.
[24] PICZAK K J. ESC-50: Dataset for Environmental Sound Classification[DB/OL]. https://github.com/karolpiczak/ESC-50.
[25] PICZAK K J. ESC: Dataset for Environmental Sound Classification[J]. Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015, 1015-1018.
[26] VEAUX C, YAMAGISHI J, KING S. The voice bank corpus: Design, collection and data analysis of a large regional accent speechdatabase[J]. IEEE International Conference Oriental COCOSDAheld jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, 2013, 1-4.
[27]HARISHCHANDRA D, ASHKAN A, VISHAK G, et al. ICASSP 2023 Deep Noise Suppression Challenge[J]. IEEE Open Journal of Signal Processing, 2024, 5: 725-737.
[28] YU G, LI A, ZHENG C, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[J]. International Conference on Acoustics, Speech and Signal Processing, 2022, 7847-7851.
[29] HASANNEZHAD M, YU H, ZHU W, et al. PACDNN: A phase-aware composite deep neural network for speech enhancement[J]. Speech Commun, 2022, 136: 1-13.
[30] KOLBAEK M, TAN Z H., JENSEN S H, et al. On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement[J]. IEEE-ACM transactions on audio speech and language processing, 2020, 28: 825-838.
[31] FU S W, YU C, HSIEH T A, et al. MetricGAN+: An improved version of MetricGAN for speech enhancement[J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 4: 2778–2782.
[32] ABDULATIF S, CAO R, YANG B. CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement[J]. ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2477-2493.
[33] WANG L, WEI W, CHAN Y. D²Net: A Denoising and Dereverberation Network Based on Two-branch Encoder and Dual-path Transformer[J]. Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC, 2022, 1649–1654.
[34] ZHANG Z, DONG Z, Xu W,et al. Reparameterization of Lightweight Transformer for On-Device Speech Emotion Recognition[J]. IEEE Internet of Things Journal, 2025, 12(4): 4169-4182.
[35] CHEN H, ZHANG J, FU Y, et al. TFDense-GAN: a generative adversarial network for single-channel speech enhancement[J]. Eurasip Journal on Advances in Signal Processing, 2025, 10(2025):1-24.

选择文件类型/文献管理软件名称

选择包含的内容