作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于并行多注意力的语音增强算法

  • 发布日期:2023-12-05

Speech Enhancement Algorithm Based on Parallel Multi-Attention

  • Published:2023-12-05

摘要: 语音增强算法的鲁棒性对比人耳依旧有差距。人耳接收语音信息时可利用语音的全局相关性对信息进行采集,依旧 可从受干扰的语音中恢复出语音所传递的信息。语音的频域特征比时域更加丰富,易于提取。针对此问题,提出一种基于并 行多注意力机制(PMAN)的编解码结构语音增强网络对受干扰语音进行频域增强。网络输入经过短时傅里叶变换的语音频 域特征,包含振幅谱和复数谱。编码器使用密集卷积模块对输入数据信息进行整合,中间层的并行多注意力模块学习频域的 局部以及全局信息,并融合局部注意力机制(LPA)捕捉语音频域二维结构,实现干净语音与干扰因素的二维层面的分离。 解码器部分将学习到的信息进行整合,分别生成振幅掩模和复数频谱,通过加权求和生成最终的语音复数频谱。通过使用时 域与频域联合损失函数实现相位信息的融入。与未使用注意力机制的算法相比,提出的算法增强效果有所提升。在 VB-DEMAND 语音数据集上进行实验结果表明:增强后的语音感知质量、短时可懂度和分段信噪比,相比于 TSTNN 网络分 别提升 10.8%,11.8%,1.05%。

Abstract: The robustness of speech enhancement algorithms still falls short of the human ear. When the human ear receives speech information, it can use the global correlation of speech to collect the information, and still can recover the information conveyed by speech from the disturbed speech. The frequency domain features of speech are richer than the time domain and are easy to be extracted. To address this problem, a codec-structured speech enhancement network based on Parallel Multi-Attention Net (PMAN) is proposed to enhance disturbed speech in the frequency domain. The network inputs short-time Fourier transformed speech frequency-domain features containing amplitude and complex spectra. The encoder integrates the input data information using a dense convolution module, and a parallel multi-attention module in the middle layer learns the local as well as global information in the frequency domain and incorporates the Local Patch Attention (LPA) mechanism to capture the two-dimensional structure of the speech frequency domain to achieve the two-dimensional level of separation between the clean speech and the interfering factors. The decoder part integrates the learned information to generate the amplitude mask and the complex spectrum respectively, and generates the final speech complex spectrum through weighted summation. Incorporation of phase information is achieved by using a joint time and frequency domain loss function. The enhancement effect of the proposed algorithm is improved compared to the algorithm without using the attention mechanism. The results of the enhancement experiments on the VoiceBank+DEMAND dataset show that the enhanced speech perception quality, short-time intelligibility, and segmental signal-to-noise ratio are improved by 10.8%, 11.8%, and 1.05%, respectively, compared to the TSTNN network.