[1] 张凯, 朱立新, 赵义正. 基于重训练高斯混合模型的语音转换方法[J]. 声学技术, 2010, 29(1):52-55. ZHANG K, ZHU L X, ZHAO Y Z. A voice conversion method based on retraining GMM[J]. Technical Acoustics, 2010, 29(1):52-55.(in Chinese) [2] CHEN L H, LING Z H, LIU L J, et al. Voice conversion using deep neural networks with layer-wise generative training[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12):1859-1872. [3] SUN L F, KANG S Y, LI K, et al. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA:IEEE Press, 2015:4869-4873. [4] SUN L F, LI K, WANG H, et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training[C]//Proceedings of the IEEE International Conference on Multimedia and Expo. Washington D. C., USA:IEEE Press, 2016:1-6. [5] 高俊峰, 陈俊国. 基于Style-CycleGAN-VC的非平行语料下的语音转换[J]. 计算机应用与软件, 2021, 38(9):133-139, 159. GAO J F, CHEN J G. Voice conversion with non-parallel corpus based on Style-CycleGAN-VC[J]. Computer Applications and Software, 2021, 38(9):133-139, 159.(in Chinese) [6] KAMEOKA H, KANEKO T, TANAKA K, et al. ACVAE-VC:non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder[EB/OL].[2023-02-23]. https://arxiv.org/pdf/1808.05092.pdf. [7] 谭智元. 基于自编码器的零样本语音转换系统研究[D]. 天津:天津大学, 2020. TAN Z Y. A study of zero-shot voice conversion system based on auto-encoder[D].Tianjin:Tianjin University, 2020. (in Chinese) [8] SAITO Y, NAKAMURA T, IJIMA Y, et al. Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification[J]. Acoustical Science and Technology, 2021, 42(1):1-11. [9] 李燕萍, 曹盼, 左宇涛, 等. 基于i向量和变分自编码相对生成对抗网络的语音转换[J]. 自动化学报, 2022, 48(7):1824-1833. LI Y P, CAO P, ZUO Y T, et al. Voice conversion based on i-vector with variational autoencoding relativistic standard generative adversarial network[J]. Acta Automatica Sinica, 2022, 48(7):1824-1833.(in Chinese) [10] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al.Connectionisttemporal classification:labelling unsegmentedsequence data with recurrent neural networks[EB/OL].[2023-02-23]. https://mediatum.ub.tum.de/doc/1292048/12285.pdf. [11] WANG C Y, WU Y, QIAN Y, et al. UniSpeech:unified speech representation learning with labeled and unlabeled data[EB/OL].[2023-02-23]. https://arxiv.org/abs/2101.07597. [12] VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[EB/OL].[2023-02-23]. https://arxiv.org/pdf/1807.03748.pdf. [13] CHUNG Y A, GLASS J. Generative pre-training for speech with autoregressive predictive coding[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D. C., USA:IEEE Press, 2020:3497-3501. [14] LIU A T, YANG S W, CHI P H, et al. Mockingjay:unsupervised speech representation learning with deep bidirectional transformer encoders[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA:IEEE Press, 2020:6419-6423. [15] LIN Y Y, CHIEN C M, LIN J H, et al. FragmentVC:any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA:IEEE Press, 2021:5939-5943. [16] RAO K, SENIOR A, SAK H. Flat start training of CD-CTC-SMBR LSTM RNN acoustic models[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA:IEEE Press, 2016:5405-5409. [17] ZHAO Z P, BAO Z T, ZHANG Z X, et al. Attention enhanced connectionist temporal classification for discrete speech emotion recognition[C]//Proceedings of the International Symposium on Computer Architecture. Phoenix, Arizona, USA:[s.n.], 2019:1-10. [18] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM:large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6):1505-1518. [19] 邓宏贵, 郭晟伟, 李志坚. 基于哈夫曼编码的矢量量化图像压缩算法[J]. 计算机工程, 2010, 36(4):218-219, 222. DENG H G, GUO S W, LI Z J. VQ image compression algorithm based on Huffman coding[J]. Computer Engineering, 2010, 36(4):218-219, 222.(in Chinese) [20] CHOU J C, YEH C C, LEE H Y. One-shot voice conversion by separating speaker and content representations with instance normalization[EB/OL].[2023-02-23]. https://arxiv.org/pdf/1904.05742.pdf. [21] WANG Y X, STANTON D, ZHANG Y, et al. Style Tokens:unsupervised style modeling, control and transfer in end-to-end speech synthesis[EB/OL].[2023-02-23]. http://arxiv.org/abs/1803.09017. [22] KUMAR K, KUMAR R, DE BOISSIERE T, et al. MelGAN:generative adversarial networks for conditional waveform synthesis[EB/OL].[2023-02-23]. http://arxiv.org/abs/1910.06711. [23] VEAUX C, YAMAGISHI J, MACDONALD K. CSTR VCTK corpus:English multi-speaker corpus for CSTR voice cloning Toolkit[EB/OL].[2023-02-23]. https://datashare.ed.ac.uk/handle/10283/3443. [24] KOMINEK J, BLACK A W. The CMU Arctic speech databases[C]//Proceedings of the 5th ISCA Speech Synthesis Workshop. Washington D. C., USA:IEEE Press, 2004:1-10. [25] KANG X, HUANG H, HU Y, et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion[J]. Digital Signal Processing, 2021, 116(6):103110. [26] HSU C C, HWANG H T, WU Y C, et al. Voice conversion from non-parallel corpora using variational auto-encoder[EB/OL].[2023-02-23]. https://arxiv.org/abs/1610.04019v1. |