作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (6): 123-130. doi: 10.19678/j.issn.1000-3428.0064966

• 人工智能与模式识别 • 上一篇    下一篇

基于双阶段Conv-Transformer的时频域语音增强算法

沈学利, 田桂源, 姜彦吉, 马琳琳   

  1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105
  • 收稿日期:2022-06-10 修回日期:2022-08-09 发布日期:2023-06-10
  • 作者简介:沈学利(1969-),男,教授,主研方向为计算机网络及信息安全、智能信息处理;田桂源,硕士研究生;姜彦吉(通信作者),副教授、博士;马琳琳,硕士研究生。
  • 基金资助:
    辽宁省教育厅科学技术项目(LJ2020FWL001)。

Time-Frequency Domain Speech Enhancement Algorithm Based on Dual-Stage Conv-Transformer

SHEN Xueli, TIAN Guiyuan, JIANG Yanji, MA Linlin   

  1. Software College, Liaoning Technical University, Huludao 125105, Liaoning, China
  • Received:2022-06-10 Revised:2022-08-09 Published:2023-06-10

摘要: 频域语音增强算法通常存在相位失配问题,而相位信息对于语音增强任务非常重要。时域语音增强算法可以有效解决相位失配问题,但是噪声和语音在频域中更易分离。为了实现时域和频域语音增强算法的优势互补,提出一种基于双阶段Conv-Transformer的时频域语音增强算法。采用编解码结构,将带噪语音经过短时傅里叶变换得到的频域特征和一维卷积处理后得到的时域特征作为输入。考虑到Transformer擅长提取语音序列的全局依赖关系,卷积神经网络可以关注局部特征,为了更好地提取时域和频域中的局部信息和全局信息,设计一种Conv-Transformer模块。在此基础上,联合时域和频域损失函数对模型进行优化,使得模型可以同时学习语音在时域和频域中的分布规律。实验结果表明,与单一域的语音增强算法相比,该算法具有更好的降噪效果,增强后的语音感知质量、短时可懂度、信号失真测度、噪声失真测度、综合质量测度分别为3.04、0.953、4.34、3.55、3.69。

关键词: 语音增强, 时频域, 卷积神经网络, 局部信息, 全局信息

Abstract: Phase information is crucial for speech enhancement tasks.However,frequency-domain speech enhancement algorithms often have phase mismatch issues.Time-domain speech enhancement algorithms can effectively solve the problem of phase mismatch,but noise and speech are more easily separated in the frequency domain.To achieve complementary advantages between time-domain and frequency-domain speech enhancement algorithms,a dual-stage Conv-Transformer-based time-frequency domain speech enhancement algorithm is proposed.Adopting an encoding and decoding structure,the frequency domain features obtained from the Short Time Fourier Transform(STFT) of noisy speech and time domain features obtained from one-dimensional convolution processing are used as inputs.Considering that the Transformer is good at extracting global dependencies of speech sequences,Convolutional Neural Network(CNN) can focus on local features.To better extract local and global information in the time and frequency domains,a Conv-Transformer module was designed.On this basis,the model was optimized by combining the time- and frequency-domain loss functions so that the model can learn the distribution of speech in both domains simultaneously.Experimental results show that compared with the single-domain speech enhancement algorithm,this algorithm has better noise reduction effect. The Perceptual Evaluation of Speech Quality(PESQ),Short Time Objective Intelligibility(STOI),Composite Measure for Signal Distortion(CSIG),Composite Measure for Background Intrusiveness(CBAK),and Composite Measure for Overall Speech Quality(COVL) were 3.04,0.953,4.34,3.55,and 3.69,respectively.

Key words: speech enhancement, time-frequency domain, Convolutional Neural Network(CNN), local information, global information

中图分类号: