作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (3): 128-133,160. doi: 10.19678/j.issn.1000-3428.0063987

• 人工智能与模式识别 • 上一篇    下一篇

引入非局部模块卷积神经网络的基频提取模型

刘晶晶, 黄浩   

  1. 新疆大学 信息科学与工程学院, 乌鲁木齐 830017
  • 收稿日期:2022-02-21 修回日期:2022-04-18 发布日期:2022-05-03
  • 作者简介:刘晶晶(1997—),女,硕士研究生,主研方向为语音信号处理;黄浩,教授、博士。
  • 基金资助:
    国家重点研发计划(2020AAA0107902);国家自然科学基金(61663044,61761041);新疆多语种信息技术重点实验室开放课题(2020D04047)。

Fundamental Frequency Extraction Model Using Convolutional Neural Networks with Non-local Modules

LIU Jingjng, HUANG Hao   

  1. School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China
  • Received:2022-02-21 Revised:2022-04-18 Published:2022-05-03

摘要: 基频或基音的估计是各种语音信号处理技术的关键子问题,现有信号处理技术研究多使用数据驱动的方法,即通过卷积神经网络进行基频提取。然而,卷积神经网络中的卷积操作一次只能处理局部的音频样本点,只有在递归应用卷积操作时才能捕获全局音频样本点依赖关系,导致计算效率低与优化困难。受非局部模块在计算机视觉任务中具有较高性能的启发,提出一种具有非局部模块的卷积神经网络用于基频提取任务。非局部模块相比不断堆叠的卷积神经网络,可以直接计算两个位置之间的关系,由于其可以忽略欧氏距离,因此能够快速捕获长范围的依赖关系。对于基频估计任务,可在卷积神经网络中加入非局部模块以计算音频样本点之间的相似性,有助于捕获帧与帧和样本点与样本点之间的全局依赖关系,且非局部模块可以保持输入输出维度不变,能够快速地集成卷积神经网络。实验结果表明,该方法平均绝对误差仅为4.7,与基线模型相比,至少降低了0.7,能够获得最佳的模型性能。

关键词: 基频, 语音信号处理, 数据驱动, 卷积神经网络, 非局部模块

Abstract: Estimating the fundamental frequency or pitch is a key sub-problem in various speech signal processing techniques.Recent studies use a data-driven approach, namely, fundamental frequency extraction with Convolutional Neural Network (CNN).However, the convolution operation in CNN can only process local audio sample points at a given time, and the global audio sample point dependencies can only be captured when the convolution operation is applied recursively.However, this introduces computational inefficiency and optimization difficulties.Inspired by the impressive performance of non-local modules in many computer vision tasks, this study proposes a CNN with non-local modules to undertake the fundamental frequency extraction task.Compared with the continuously stacked CNN, CNN with non-local modules can effectively obtain the relationship between two positions, that is, they can quickly capture long-range dependencies because they ignore the Euclidean distance.In the pitch estimation task, when non-local modules are added to CNNs to calculate the similarity between all audio sample points in each frame, they help capture the global dependencies between frame-to-frame and sample-to-sample with slightly increased computational complexity.Moreover, non-local modules do not alter the input and output dimensions;thus, they can be easily integrated with CNN.The experimental results demonstrate that the Mean Absolute Error (MAE) of the proposed method is only 4.7, which is at least 0.7 lower than that of the baseline model, and state-of-the-art performance is obtained.

Key words: fundamental frequency, speech signal processing, data-driven, Convolutional Neural Network(CNN), non-local modules

中图分类号: