作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 138-147. doi: 10.19678/j.issn.1000-3428.0067970

• 人工智能与模式识别 • 上一篇    下一篇

融合多尺度特征与上下文信息的语音增强方法

更藏措毛1,2, 黄鹤鸣1,2, 杨毅杰1,2   

  1. 1. 青海师范大学计算机学院, 青海 西宁 810008;
    2. 藏语智能信息处理及应用国家重点实验室, 青海 西宁 810000
  • 收稿日期:2023-06-29 修回日期:2023-09-26 出版日期:2024-06-15 发布日期:2024-06-22
  • 通讯作者: 更藏措毛,E-mail:1021489068@qq.com E-mail:1021489068@qq.com
  • 基金资助:
    青海省基础研究计划项目(2022-ZJ-925);国家自然科学基金(62066039);省部共建藏语智能信息处理及应用国家重点实验室自主课题(2022-SKL-002,2022-SKL-007);2021年青海师范大学自然科学中青年项目科研基金(KJQN2021001)。

Speech Enhancement Method Incorporating Multi-Scale Features and Contextual Information

Gengzangcuomao1,2, HUANG Heming1,2, YANG Yijie1,2   

  1. 1. School of Computer Science and Technology, Qinghai Normal University, Xining 810008, Qinghai, China;
    2. State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810000, Qinghai, China
  • Received:2023-06-29 Revised:2023-09-26 Online:2024-06-15 Published:2024-06-22

摘要: 在语音增强中,常用自编码器结构自动提取特征,但这样得到的特征单一或者冗余且不能较好地捕获语音信号的上下文依赖关系。因此,提出一种融合多尺度特征和上下文信息的语音增强方法MSF-CI。首先,利用多尺度卷积块提取语音信号的多尺度特征,解决特征单一问题;其次,利用注意力机制关注所提取特征的空间与通道关键信息,解决特征冗余问题;最后,使用门控卷积循环神经网络学习语音信号中跨度较长的上下文依赖关系,并通过门控线性单元提高该网络的非线性学习能力,从而提高模型的泛化性。实验结果表明,MSF-CI在低信噪比和不同噪声环境下增强语音信号的语音感知质量、短时客观可懂度等多个指标上均优于GRN、DPT-FSNet、U-Net等同类的单通道语音增强模型。在信噪比为0 dB时,该方法的平均语音感知质量和平均语音客观可懂度达到1.49和0.761。在构建的安多藏语语料库上验证模型的泛化性,平均语音感知质量和平均语音客观可懂度相对于噪声提高了20.7%和11.3%,MSF-CI模型不仅可以提升语音的质量与可理解度,而且具有较优的泛化性。

关键词: 语音增强, 多尺度特征, 注意力机制, 门控卷积循环神经网络, 对数能量谱

Abstract: In speech enhancement, Auto-Encoder (AE) structures are typically used to extract features automatically. However, the features obtained in this manner are singular, redundant, and cannot adequately capture the contextual dependencies of speech signals. Therefore, a speech-enhancement method, MSF-CI, that incorporates multi-scale features and contextual information is proposed. First, a multi-scale convolutional block is used to extract multi-scale features of speech signals to solve the issue of single features. Second, the attention mechanism is applied to focus on the spatial and channel key information of the extracted features to eliminate feature redundancy. Finally, a Gated Convolutional Recurrent Neural(GCRN) network is used to learn the long-span context-dependent relations of the speech signal, whereas gated linear units are employed to improve the nonlinear learning ability and thus improve the generalization of the network. Experimental results show that the proposed MSF-CI method outperforms similar single-channel speech-enhancement models such as GRN, DPT-FSNet, and U-Net in terms of speech-perception quality and the short-term objective intelligibility of enhanced speech signals at low Signal-to-Noise Ratios(SNR) and in different noise environments. Under a SNR is 0 dB, the average speech-perception quality and average speech objective intelligibility of the proposed method are 1.49 and 0.761, respectively. The generalizability of the proposed method is verified on the Ando Tibetan corpus. Additionally, its average speech-perception quality and average speech objective intelligibility improved by 20.7% and 11.3%, respectively, with respect to noise. Therefore, the MSF-CI model not only enhances speech quality and intelligibility but also provides better generalization.

Key words: speech enhancement, multi-scale feature, attention mechanism, Gated Convolutional Recurrent Neural(GCRN) network, Logarithmic Power Spectrum(LPS)

中图分类号: