作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (4): 313-320. doi: 10.19678/j.issn.1000-3428.0067272

• 开发研究与工程应用 • 上一篇    下一篇

引入预训练表示混合矢量量化和CTC的语音转换

王琳, 黄浩   

  1. 新疆大学信息科学与工程学院, 新疆 乌鲁木齐 830017
  • 收稿日期:2023-03-27 修回日期:2023-06-25 发布日期:2023-07-18
  • 通讯作者: 王琳,E-mail:wngln0722@163.com E-mail:wngln0722@163.com
  • 基金资助:
    新疆维吾尔自治区重点实验室开放课题(2020D04047)。

Voice Conversion Combining Vector Quantization and CTC Introducing Pre-Trained Representation

WANG Lin, HUANG Hao   

  1. School of Information Science and Engineering, Xinjiang University, Urumqi 830017, Xinjiang, China
  • Received:2023-03-27 Revised:2023-06-25 Published:2023-07-18

摘要: 预训练模型通过自监督学习表示在非平行语料语音转换(VC)取得了重大突破。随着自监督预训练表示(SSPR)的广泛使用,预训练模型提取的特征中被证实包含更多的内容信息。提出一种基于SSPR同时结合矢量量化(VQ)和联结时序分类(CTC)的VC模型。将预训练模型提取的SSPR作为端到端模型的输入,用于提高单次语音转换质量。如何有效地解耦内容表示和说话人表示成为语音转换中的关键问题。使用SSPR作为初步的内容信息,采用VQ从语音中解耦内容和说话人表示。然而,仅使用VQ只能将内容信息离散化,很难将纯粹的内容表示从语音中分离出来,为了进一步消除内容信息中说话人的不变信息,提出CTC损失指导内容编码器。CTC不仅作为辅助网络加快模型收敛,同时其额外的文本监督可以与VQ联合优化,实现性能互补,学习纯内容表示。说话人表示采用风格嵌入学习,2种表示作为系统的输入进行语音转换。在开源的CMU数据集和VCTK语料库对所提的方法进行评估,实验结果表明,该方法在客观上的梅尔倒谱失真(MCD)达到8.896 dB,在主观上的语音自然度平均意见分数(MOS)和说话人相似度MOS分别为3.29和3.22,均优于基线模型,此方法在语音转换的质量和说话人相似度上能够获得最佳性能。

关键词: 预训练表示, 自监督学习, 矢量量化, 解耦, 联结时序分类

Abstract: Pre-trained models have achieved significant breakthroughs in nonparallel corpus Voice Conversion (VC) via Self-Supervised Pre-trained Representation (SSPR). Features extracted by pre-trained models contain a significant amount of content information owing to the widespread use of SSPR. This study proposes a VC model based on the combination of SSPR Vector Quantization (VQ) and Connectionist Temporal Classification (CTC). It uses the SSPR extracted from a pre-trained model as input to improve the quality of single VC. The effective decoupling of content and speaker representations has become a key issue in VC. Using SSPR as the initial content information, VQ is performed to decouple content and speaker representations from speech. However, performing only VQ discretizes only the content information, thus rendering it difficult to separate pure content representations from speech. To further eliminate speaker-invariant information from the content information, a CTC loss-guided content encoder is proposed. CTC not only serves as an auxiliary network to accelerate model convergence but also its additional text supervision can be jointly optimized with VQ to achieve complementary performance and learn pure-content representations. Speaker representations adopt style-embedding learning, and two representations are used as inputs for VC in the system. The proposed method is evaluated on the open-source CMU dataset and VCTK corpus. Experimental results show that the proposed method achieves an objective Mel-Cepstrum Distortion(MCD) of 8.896 dB, as well as subjective Mean Opinion Score (MOS) of speech naturalness and speaker similarity of 3.29 and 3.22, respectively, both of which are better than those of the baseline model. This method achieves the best performance in terms of VC quality and speaker similarity.

Key words: pre-trained representation, Self-Supervised Learning(SSL), Vector Quantization(VQ), decouple, Connectionist Temporal Classification(CTC)

中图分类号: