作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (6): 115-122,130. doi: 10.19678/j.issn.1000-3428.0065282

• 人工智能与模式识别 • 上一篇    下一篇

基于GAN的全时间尺度语音增强方法

沈梦强1,2, 于文年2, 易黎2, 宋南2   

  1. 1. 武汉邮电科学研究院, 武汉 430074;
    2. 南京烽火天地通信科技有限公司, 南京 210019
  • 收稿日期:2022-07-18 修回日期:2022-08-25 发布日期:2022-09-29
  • 作者简介:沈梦强(1998-),男,硕士研究生,主研方向为语音增强、语音识别;于文年、易黎,高级工程师、硕士;宋南,工程师。
  • 基金资助:
    国家重点研发计划(2017YFB1400704)。

Full-Time Scale Speech Enhancement Method Based on GAN

SHEN Mengqiang1,2, YU Wennian2, YI Li2, SONG Nan2   

  1. 1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
    2. Nanjing Fiberhome World Communication Technology Co., Ltd., Nanjing 210019, China
  • Received:2022-07-18 Revised:2022-08-25 Published:2022-09-29

摘要: 现有语音增强方法在时域端到端语音增强中无法学习全面时间尺度特征信息且中间层序列建模不充分。提出从全面时间尺度进行语音降噪的方法。通过线性插值方式对输入特征序列进行扩充,获得比原输入数据具有更高分辨率的时间特征,使得模型能够从更细时间尺度上进行特征提取,并利用间隔抽样方法对每一层编码后的特征进行下采样,在较粗时间尺度上计算越来越多的高维特征,使网络模型能够捕获深层次的有用信息。同时,在网络模型中引入ConformerBlock作为中间层,其中,多头注意力机制和卷积模块能够增强中间层网络的序列建模能力,突出中间向量的表征信息,根据语音和噪声线性叠加的原理,采用联合噪声训练生成对抗网络的方法使网络从目标语音和噪声2个角度获取有用信息,进一步提升模型降噪能力。在公开语音增强测试数据集上的实验结果表明,该方法降噪后语音的各项指标均得到显著提升,相比Wave-U-net模型,在PESQ、STOI和SSNR这3项主要指标上分别提升了2.75%、1.06%、6.34%。

关键词: 全时间尺度, 高分辨率, 线性插值, Conformer模块, 生成对抗网络

Abstract: Existing speech enhancement methods cannot learn comprehensive time scale feature information in time-domain end-to-end speech enhancement.Furthermore,the modeling of intermediate layer sequences is insufficient.Therefore,this study proposes a method for speech denoising from a comprehensive time-scale.The input feature sequence is expanded by linear interpolation to obtain time features with higher resolution than the original input data,such that the model can extract features from a finer time scale.The feature encoded at each layer is sampled down using interval sampling,and an increasing number of high-dimensional features are calculated on coarser time scales to capture useful information at a deeper level.Simultaneously,ConformerBlock is introduced as the intermediate layer in the network model.The multi-head attention mechanism and the convolutional module enhance the sequence modeling ability of the intermediate layer of the network,highlight the representation information of the intermediate vector,and apply the principle of linear superposition of speech and noise to train and Generative Adversarial Network(GAN) through joint noise training.This enables the network to obtain useful information from the target speech and noise perspectives,further improving the denoising ability of the model.The experimental results on a publicly available speech enhancement test dataset show that the proposed method is significantly superior in terms of various indicators of speech denoising.For instance,the proposed model outperforms the Wave-U-net model in three main indicators,PESQ,STOI,and SSNR,by 2.75%,1.06%,and 6.34%,respectively.

Key words: full-time scale, high resolution, linear interpolation, Conformer module, Generative Adversarial Network(GAN)

中图分类号: