作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (9): 121-129. doi: 10.19678/j.issn.1000-3428.0068661

• 人工智能与模式识别 • 上一篇    下一篇

基于跨模态注意力的目标语音提取

杨明强, 卢健*()   

  1. 大连大学信息工程学院, 辽宁 大连 116000
  • 收稿日期:2023-10-23 出版日期:2024-09-15 发布日期:2024-01-25
  • 通讯作者: 卢健
  • 基金资助:
    NSFC-辽宁联合重点支持项目(U1708263)

Target Speech Extraction Based on Cross-modal Attention

YANG Mingqiang, LU Jian*()   

  1. School of Information Engineering, Dalian University, Dalian 116000, Liaoning, China
  • Received:2023-10-23 Online:2024-09-15 Published:2024-01-25
  • Contact: LU Jian

摘要:

目标语音提取作为语音分离领域的一部分, 旨在从混合语音数据中提取出目标语音。考虑到视听信息具有天然一致性, 在进行模型训练时, 可以融合视觉信息指导模型对目标语音的提取。对此, 传统方法是将视觉特征和音频特征进行简单拼接, 然后进行卷积操作实现通道融合, 这种方法无法有效挖掘到跨模态信息间的相关性。针对这个问题, 设计一个基于两阶段的跨模态注意力特征融合模块。在第一阶段进行点积注意力计算来挖掘跨模态信息间存在的浅层相关性, 在第二阶段进行自注意力计算来捕捉目标语音特征间的全局依赖关系, 以增强目标语音的特征表示, 2个融合阶段分别训练不同的可学习参数来调节注意力权重。此外还在时间卷积网络(TCN)中引入门控循环单元(GRU)来增强其捕捉序列数据间长期依赖关系的能力, 从而改善视觉特征的提取, 进一步提升视听特征的融合效果。在VoxCeleb2和LRS2-BBC两个数据集上进行测试, 实验结果表明, 相比于基线方法, 提出的方法在2个数据集上都有较好的表现, 在评估指标源失真比(SDR)上分别提升了1.05 dB和0.26 dB。

关键词: 目标语音提取, 跨模态融合, 自注意力, 时间卷积网络, 门控循环单元

Abstract:

Target speech extraction, as part of the speech separation field, aims to extract target speech from mixed-speech data. Considering the natural consistency between visual and auditory information, integrating visual information during model training can guide the model in extracting the target speech. The traditional method involves concatenating visual and audio features, followed by convolution operations for channel fusion. However, this method fails to effectively explore the correlation between cross-modal information. A two-stage cross-modal attention feature fusion module was proposed to address this problem. First, dot-product attention was used to explore the shallow correlation between cross-modal information. Second, self-attention was employed to capture the global dependency among the target speech features, thereby enhancing the representation of the target speech. The two fusion stages trained different learnable parameters to adjust the attention weights. Additionally, a Gated Recurrent Unit(GRU) was introduced into the Temporal Convolutional Network(TCN) to enhance the ability to capture long-term dependencies among sequential data, thereby improving visual feature extraction and enhancing the fusion of audio-visual features. Finally, experiments were conducted on the VoxCeleb2 and LRS2-BBC datasets. The proposed method performed favorably well compared to those of baseline methods, achieving improvements of 1.05 dB and 0.26 dB on the respective datasets, using the Source-to-Distortion Ratio(SDR) evaluation metric.

Key words: target speech extraction, cross-modal fusion, self-attention, Temporal Convolutional Network(TCN), Gated Recurrent Unit(GRU)