作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

融合时频Transformer的U-Net语音增强网络

  • 发布日期:2026-04-08

U-Net Speech Enhancement Network Integrated with Time-Frequency Transformer

  • Published:2026-04-08

摘要: 语音增强技术面临的一个挑战是现有基于Transformer的方法在局部特征建模上存在不足,难以准确还原语音中的高频细节与瞬态成分。为解决这一问题,研究设计了一种融合时频Transformer的U-Net语音增强网络,旨在通过改进注意力机制和特征融合来提升去噪效果。该网络设计了并行时频联合注意力模块,对时域与频域数据进行显式区分与并行处理;同时在瓶颈层引入局部-全局特征协同模块,将密集连接空洞空间金字塔池化的多尺度局部特征提取能力与Transformer的全局建模优势结合。局部-全局特征协同模块通过动态特征校准机制,实现多尺度局部上下文与全局依赖关系的协同,从而增强对语音结构的感知。网络采用频谱映射方式,通过短时傅里叶变换将语音转换为时频表示,经处理后再通过逆短时傅里叶变换重构为时域信号。在由纯净语音数据集LibriSpeech和噪声语音数据集ESC-50数据集、哥伦比亚大学噪声库构建的10小时训练集和1小时验证集上,该网络在多项客观指标上表现优异,语音质量感知评估达3.37,短时客观可懂度达97%,尺度不变信噪失真比达19.97dB,超越了多种现有先进模型。

Abstract: A challenge in speech enhancement is that existing Transformer-based methods are insufficient in modeling local features, making it difficult to accurately restore high-frequency details and transient components in speech. To address this issue, a U-Net speech enhancement network integrating a time–frequency Transformer was designed, aiming to improve denoising performance by refining the attention mechanism and feature fusion. The network incorporates a parallel time–frequency joint attention module that explicitly distinguishes and processes time-domain and frequency-domain data in parallel. Additionally, a local–global feature collaboration module is introduced at the bottleneck layer, combining the multi-scale local feature extraction capability of densely connected atrous spatial pyramid pooling with the global modeling advantages of the Transformer. This local–global feature collaboration module employs a dynamic feature calibration mechanism to achieve synergy between multi-scale local context and global dependencies, thereby enhancing perception of speech structure. The network adopts a spectral mapping approach, converting speech into a time–frequency representation via short-time Fourier transform, processing it, and then reconstructing the time-domain signal through inverse short-time Fourier transform. On a 10-hour training set and a 1-hour validation set constructed from the clean speech dataset LibriSpeech and the noise speech datasets ESC-50 and the Columbia University Noise Library, the network achieved excellent performance on multiple objective metrics. The PESQ reached 3.37, the STOI was 97%, and the SI-SDR reached 19.97 dB, surpassing several existing state-of-the-art models.