[1] 赵明园,周若华,袁庆升,等.深度学习在音乐生成中的研究与应用综述[J/OL].计算机工程与应用,1-42[2025-12-25].
Zhao Mingyuan, Zhou Ruohua, Yuan Qingsheng, et al. A Survey of Research and Applications of Deep Learning in Music Generation[J/OL]. Computer Engineering and Applications, 1-42[2025-12-25].
[2] Pachet F, Roy P. Markov constraints: steerable generation of Markov sequences[J]. Constraints, 2011, 16(2): 148-172.
[3] Briot J P, Hadjeres G, Pachet F D. Deep Learning Techniques for Music Generation--A Survey[J]. 2019.
[4] Huang C Z A, Vaswani A, Uszkoreit J, et al. Music Transformer: Generating Music with Long-Term Structure[C]//International Conference on Learning Representations.
[5] Kong Z, Ping W, Huang J, et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis[C]//International Conference on Learning Representations.
[6] Wang Y, Wu S, Hu J, et al. NotaGen: advancing musicality in symbolic music generation with large language model training paradigms[C]//Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. 2025: 10207-10215.
[7] Rinaldi I, Fanelli N, Castellano G, et al. Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 173-186.
[8] Lam M W Y, Tian Q, Li T, et al. Efficient neural music generation[J]. Advances in Neural Information Processing Systems, 2023, 36: 17450-17463.
[9] Deng Z, Ma Y, Liu Y, et al. Musilingo: Bridging music and text with pre-trained language models for music captioning and query response[C]//Findings of the Association for Computational Linguistics: NAACL 2024. 2024: 3643-3655.
[10] Mittal G, Engel J, Hawthorne C, et al. Symbolic music generation with diffusion models[J]. arXiv preprint arXiv:2103.16091, 2021.
[11] Dadman S, Bremdal B A, Bang B, et al. Toward interactive music generation: A position paper[J]. IEEE Access, 2022, 10: 125679-125695.
[12] Ma Y, Øland A, Ragni A, et al. Foundation Models for Music: A Survey[J]. CoRR, 2024.
[13] Donahue C, McAuley J, Puckette M. Adversarial Audio Synthesis[C]//International Conference on Learning Representations.
[14] 郑文秀,赵峻毅,文心怡,等.基于瓶颈复合特征的声学模型建立方法[J].计算机工程,2020,46(11):301-305+314.
Zheng Wenxiu, Zhao Junyi, Wen Xinyi, et al. An Acoustic Model Building Method Based on Bottleneck Composite Features[J]. Computer Engineering, 2020, 46(11): 301-305+314.
[15] Manzelli R, Thakkar V, Siahkamari A, et al. An end to end model for automatic music generation: Combining deep raw and symbolic audio networks[C]//Proceedings of the musical metacreation workshop at 9th international conference on computational creativity, Salamanca, Spain. 2018.
[16] Lu X, Wang J, Zhuang B, et al. A syllable-structured, contextually-based conditionally generation of chinese lyrics[C]//Pacific Rim International Conference on Artificial Intelligence. Cham: Springer International Publishing, 2019: 257-265.
[17] Sheng Z, Song K, Tan X, et al. Songmass: Automatic song writing with pre-training and alignment constraint[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(15): 13798-13805.
[18] Bai Y, Chen H, Chen J, et al. Seed-music: A unified framework for high quality and controlled music generation[J]. arXiv preprint arXiv:2409.09214, 2024.
[19] Lei S, Zhou Y, Tang B, et al. Songcreator: Lyrics-based universal song generation[J]. Advances in Neural Information Processing Systems, 2024, 37: 80107-80140.
[20] Liu Z, Ding S, Zhang Z, et al. SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation[C]//International Conference on Machine Learning. PMLR, 2025: 38351-38364.
[21] Dong H W, Hsiao W Y, Yang L C, et al. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
[22] Wu J, Liu X, Hu X, et al. PopMNet: Generating structured pop music melodies using neural networks[J]. Artificial Intelligence, 2020, 286: 103303.
[23] Ren Y, He J, Tan X, et al. Popmag: Pop music accompaniment generation[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 1198-1206.
[24] Nishimura M, Hashimoto K, Oura K, et al. Singing Voice Synthesis Based on Deep Neural Networks[C]//Interspeech. 2016: 2478-2482.
[25] Gu Y, Yin X, Rao Y, et al. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders[C]//2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2021: 1-5.
[26] Chandna P, Blaauw M, Bonada J, et al. Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan[C]//2019 27th European signal processing conference (EUSIPCO). IEEE, 2019: 1-5.
[27] Liu J, Li C, Ren Y, et al. Diffsinger: Singing voice synthesis via shallow diffusion mechanism[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(10): 11020-11028.
[28] Zhang Y, Cong J, Xue H, et al. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7237-7241.
[29] Défossez A, Zeghidour N, Usunier N, et al. Sing: Symbol-to-instrument neural generator[J]. Advances in neural information processing systems, 2018, 31.
[30] Schimbinschi F, Walder C, Erfani S M, et al. SynthNet: Learning to Synthesize Music End-to-End[C]//IJCAI. 2019: 3367-3374.
[31] Engel J, Gu C, Roberts A. DDSP: Differentiable Digital Signal Processing[C]//International Conference on Learning Representations.
[32] Lam M W Y, Tian Q, Li T, et al. Efficient neural music generation[J]. Advances in Neural Information Processing Systems, 2023, 36: 17450-17463.
[33] Le D V T, Bigo L, Herremans D, et al. Natural language processing methods for symbolic music generation and information retrieval: A survey[J]. ACM Computing Surveys, 2025, 57(7): 1-40.
[34] Jung J, Kim D, Lee S, et al. Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio[J]. arXiv preprint arXiv:2505.12863, 2025.
[35] Li C, Wang R, Liu L, et al. QA-MDT: quality-aware masked diffusion transformer for enhanced music generation[C]//Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. 2025: 10135-10143.
[36] Guo H, Zhang J, Jiang Y, et al. Emo-music: Emotion recognition based music therapy with deep learning on physiological signals[C]//2024 IEEE First International Conference on Artificial Intelligence for Medicine, Health and Care (AIMHC). IEEE, 2024: 10-13.
[37] Engel J, Resnick C, Roberts A, et al. Neural audio synthesis of musical notes with wavenet autoencoders[C]//International conference on machine learning. PMLR, 2017: 1068-1077.
[38] Mehri S, Kumar K, Gulrajani I, et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model[C]//International Conference on Learning Representations. 2017.
[39] Van Den Oord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016, 12: 1.
[40] Jeong D, Kwon T, Kim Y, et al. VirtuosoNet: A Hierarchical RNN-based System for Modeling Expressive Piano Performance[C]//ISMIR. 2019: 908-915.
[41] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27.
[42] Engel J, Agrawal K K, Chen S, et al. GANSynth: Adversarial Neural Audio Synthesis[C]//International Conference on Learning Representations.
[43] Marafioti A, Perraudin N, Holighaus N, et al. Adversarial generation of time-frequency features with application in audio synthesis[C]//International conference on machine learning. PMLR, 2019: 4352-4362.
[44] Hurle J N, Lattner S, Richard G. DrumGAN: Synthesis of drum sounds with timbral feature conditioning using Generative Adversarial Networks[C]//21st International Society for Music Information Retrieval Conference (ISMIR). 2020.
[45] Roberts A, Engel J, Raffel C, et al. A hierarchical latent vector model for learning long-term structure in music[C]//International conference on machine learning. PMLR, 2018: 4364-4373.
[46] Dhariwal P, Jun H, Payne C, et al. Jukebox: A generative model for music[J]. arXiv preprint arXiv:2005.00341, 2020.
[47] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[48] 汪涛,靳聪,李小兵,等.基于Transformer的多轨音乐生成对抗网络[J].计算机应用,2021,41(12):3585-3589.
Wang Tao, Jin Cong, Li Xiaobing, et al. Multi-track Music Generative Adversarial Network Based on Transformer[J]. Journal of Computer Applications, 2021, 41(12): 3585-3589.
[49] Gong J, Zhao S, Wang S, et al. Ace-step: A step towards music generation foundation model[J]. arXiv preprint arXiv:2506.00045, 2025.
[50] Borsos Z, Marinier R, Vincent D, et al. Audiolm: a language modeling approach to audio generation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2023, 31: 2523-2533.
[51] Zanon Boito M, Iyer V, Lagos N, et al. mHuBERT-147: A Compact Multilingual HuBERT Model[C]//Proc. Interspeech 2024. 2024: 3939-3943.
[52] Chung Y A, Zhang Y, Han W, et al. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021: 244-250.
[53] Zeghidour N, Luebs A, Omran A, et al. Soundstream: An end-to-end neural audio codec[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 495-507.
[54] Rong Y, Wang J, Lei G, et al. Audiogenie: A training-free multi-agent framework for diverse multimodality-to-multiaudio generation[C]//Proceedings of the 33rd ACM International Conference on Multimedia. 2025: 8872-8881.
[55] 白勇,帖云.一种基于扩散模型的音乐生成方法[J/OL].计算机应用与软件,1-6[2025-12-25].
Bai Yong, Tie Yun. A Music Generation Method Based on Diffusion Model[J/OL]. Computer Applications and Software, 1-6[2025-12-25].
[56] Liu H, Chen Z, Yuan Y, et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models[C]//International Conference on Machine Learning. PMLR, 2023: 21450-21474.
[57] Huang Q, Park D S, Wang T, et al. Noise2music: Text-conditioned music generation with diffusion models[J]. arXiv preprint arXiv:2302.03917, 2023.
[58] Chen K, Wu Y, Liu H, et al. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 1206-1210.
[59] Huang R, Lam M W Y, Wang J, et al. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis[C]//IJCAI International Joint Conference on Artificial Intelligence. IJCAI: International Joint Conferences on Artificial Intelligence Organization, 2022: 4157-4163.
[60] Lipman Y, Chen R T Q, Ben-Hamu H, et al. Flow Matching for Generative Modeling[C]//11th International Conference on Learning Representations, ICLR 2023. 2023.
[61] Vyas A, Shi B, Le M, et al. Audiobox: Unified audio generation with natural language prompts[J]. arXiv preprint arXiv:2312.15821, 2023.
[62] Guo Y, Du C, Ma Z, et al. Voiceflow: Efficient text-to-speech with rectified flow matching[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 11121-11125.
[63] Liu X, Gong C. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow[C]//NeurIPS 2022 Workshop on Score-Based Methods.
[64] Ning Z, Chen H, Jiang Y, et al. DiffRhythm: Blazingly fast and embarrassingly simple end-to-end full-length song generation with latent diffusion[J]. arXiv preprint arXiv:2503.01183, 2025.
[65] Liu H, Yuan Y, Liu X, et al. Audioldm 2: Learning holistic audio generation with self-supervised pretraining[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 2871-2883.
[66] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[67] Zhang C, Ma Y, Chen Q, et al. InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation[J]. arXiv preprint arXiv:2503.00084, 2025.
[68] Tjandra A, Wu Y C, Guo B, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound[J]. arXiv preprint arXiv:2502.05139, 2025.
[69] Yuan R, Lin H, Guo S, et al. Yue: Scaling open foundation models for long-form music generation[J]. arXiv preprint arXiv:2503.08638, 2025.
[70] Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. ICLR, 2022, 1(2): 3.
[71] Agostinelli A, Denk T I, Borsos Z, et al. Musiclm: Generating music from text[J]. arXiv preprint arXiv:2301.11325, 2023.
[72] Liu S, Hussain A S, Wu Q, et al. M $^{2} $ UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models[J]. arXiv preprint arXiv:2311.11255, 2023.
[73] Liu S, Wu Q, Sun C, et al. Mumu-llama: Multi-modal music understanding and generation via large language models[J]. Expert Systems with Applications, 2025: 130688.
[74] Liu L, Gong R, Yang Y. MusDiff: A multimodal-guided framework for music generation[J]. Alexandria Engineering Journal, 2025, 129: 128-136.
[75] Sun X, Han X, Yao F, et al. Emotion-aware cross-modal music generation based on multimodal emotion recognition[J]. Alexandria Engineering Journal, 2025, 133: 254-270.
[76] Wu S, Wang Y, Yuan R, et al. Clamp 2: Multimodal music information retrieval across 101 languages using large language models[C]//Findings of the Association for Computational Linguistics: NAACL 2025. 2025: 435-451.
[77] Agres K R, Schaefer R S, Volk A, et al. Music, computing, and health: a roadmap for the current and future roles of music technology for health care and well-being[J]. Music & Science, 2021, 4: 2059204321997709.
[78] Huh M, Fraser A C, Li D, et al. VidTune: Creating Video Soundtracks with Generative Music and Contextual Thumbnails[J]. arXiv preprint arXiv:2601.12180, 2026.
[79] Liu H, Wang Z, Hong H, et al. MetaBGM: Dynamic Soundtrack Transformation For Continuous Multi-Scene Experiences With Ambient Awareness And Personalization[J]. arXiv preprint arXiv:2409.03844, 2024.
[80] Zhang L. Compositional tools based on artificial intelligence for choral artistic educati on: Enhancing creative skills in choral arrangements[J]. Thinking Skills and Creativity, 2025, 56: 101768.
[81] Yang W, Huang C F, Huang H Y, et al. Research on the improvement of children’s attention through binaural beats music therapy in the context of ai music generation[M]//Summit on Music Intelligence. Singapore: Springer Nature Singapore, 2023: 19-31.
[82] Williams D, Hodge V J, Wu C Y. On the use of ai for generation of functional music to improve mental health[J]. Frontiers in Artificial Intelligence, 2020, 3: 497864.
[83] Cai Y, Liu Z, Yang Z, et al. Starrypia: An AR Gamified Music Adjuvant Treatment Application for Children with Autism Based on Combined Therapy[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023: 1-16.
[84] Briot J P, Hadjeres G, Pachet F D. Deep learning techniques for music generation[M]. Heidelberg: Springer, 2020.
[85] Kong Q, Li B, Chen J, et al. GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music[J]. Transactions of the International Society for Music Information Retrieval, 2022, 5(1).
[86] Defferrard M, Benzi K, Vandergheynst P, et al. FMA: A Dataset For Music Analysis[C]//18th International Societyfor Music Information Retrieval Conference. 2017.
[87] Gemmeke J F, Ellis D P W, Freedman D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017: 776-780.
[88] Huang W C, Violeta L P, Liu S, et al. The singing voice conversion challenge 2023[C]//2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023: 1-8.
[89] Zhang L, Li R, Wang S, et al. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus[J]. Advances in Neural Information Processing Systems, 2022, 35: 6914-6926.
[90] Bertin-Mahieux T, Ellis D P W, Whitman B, et al. The million song dataset[J]. 2011.
[91] Manco I, Weck B, Doh S, et al. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation[C]//Workshop on Machine Learning for Audio, Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems, 2023.
[92] Melechovsky J, Guo Z, Ghosal D, et al. Mustango: Toward controllable text-to-music generation[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024: 8293-8316.
[93] Shatri E, Fazekas G. DoReMi: First glance at a universal OMR dataset[J]. arXiv preprint arXiv:2107.07786, 2021.
[94] Thickstun J, Harchaoui Z, Kakade S. Learning Features of Music From Scratch[C]//International Conference on Learning Representations. 2017.
[95] Wang Z, Chen K, Jiang J, et al. Pop909: A pop-song dataset for music arrangement generation[J]. arXiv preprint arXiv:2008.07142, 2020.
[96] Chi X, Wang Y, Cheng A, et al. Mmtrail: A multimodal trailer video dataset with language and music descriptions[J]. arXiv preprint arXiv:2407.20962, 2024.
[97] Dash A, Agres K. Ai-based affective music generation systems: A review of methods and challenges[J]. ACM Computing Surveys, 2024, 56(11): 1-34.
[98] Yao J, Ma G, Xue H, et al. SongEval: A Benchmark Dataset for Song Aesthetics Evaluation[J]. arXiv preprint arXiv:2505.10793, 2025.
[99] Hernandez-Olivan C, Puyuelo J A, Beltran J R. Subjective evaluation of deep learning models for symbolic music composition[J]. arXiv preprint arXiv:2203.14641, 2022.
[100] Roblek D, Kilgour K, Sharifi M, et al. Fr\'echet Audio Distance: A Reference-free Metric for Evaluating Music Enhancement Algorithms[C]//Proc. Interspeech. 2019: 2350-2354.
[101] Kominek J, Schultz T, Black A W. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion[C]//SLTU. 2008: 63-68.
[102] Huang Q, Jansen A, Lee J, et al. MuLan: A Joint Embedding of Music Audio and Natural Language[C]//Ismir 2022 Hybrid Conference. 2022.
[103] Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model[J]. Advances in neural information processing systems, 2023, 36: 53728-53741.
[104] Jiang N, Jin S, Duan Z, et al. Rl-duet: Online music accompaniment generation using deep reinforcement learning[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(01): 710-718.
[105] Cideron G, Girgin S, Verzetti M, et al. MusicRL: aligning music generation to human preferences[C]//Proceedings of the 41st International Conference on Machine Learning. 2024: 8968-8984.
[106] Mitra R, Zualkernan I. Music generation using deep learning and generative AI: a systematic review[J]. IEEE Access, 2025.
[107] Zixun G, Makris D, Herremans D. Hierarchical recurrent neural networks for conditional melody generation with long-term structure[C]//2021 international joint conference on neural networks (IJCNN). IEEE, 2021: 1-8.
[108] Katyal Y, Singh S V, Saxena A, et al. Exploring the Evolution of Music and Artificial Intelligence[C]//2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE). IEEE, 2024: 390-393.
[109] Dadman S, Bremdal B A, Bang B, et al. Toward interactive music generation: A position paper[J]. IEEE Access, 2022, 10: 125679-125695.
[110]Um K L, Jung J. Evolution and historical review of music in mass media[J]. International Journal of Advanced Culture Technology, 2024: 370-379.
|