[1] 杨璐嘉, 周溪石, 王俊超, 等. 基于深度学习的2D虚拟人驱动技术综述[J]. 计算机研究与发展, 2026, 63(3): 1-27.
Yang Lujia, Zhou Xishi, Wang Junchao, et al. Survey of 2D Virtual Human Driving Technology Based on Deep Learning[J]. Journal of Computer Research and Development, 2026, 63(3): 1-27.
[2] Zhou H, Sun Y, Wu W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 4176-4186.
[3] Zhou Y, Han X, Shechtman E, et al. Makelttalk: speaker-aware talking-head animation[J]. ACM Transactions On Graphics (TOG), 2020, 39(6): 1-15.
[4] Wei H, Yang Z, Wang Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation[J]. arXiv preprint arXiv:2403.17694, 2024.
[5] Zhong W, Fang C, Cai Y, et al. Identity-preserving talking face generation with landmark and appearance priors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 9729-9738.
[6] Zhu Y, Bai L, Xu J, et al. Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter[J]. arXiv preprint arXiv:2503.06397, 2025.
[7] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[8] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33: 6840-6851.
[9] 乐铮, 胡永婷, 徐勇. 音频驱动的说话人面部视频生成与鉴别综述[J]. 计算机研究与发展, 2025, 62(10): [1] 2523-2544.
Le Zheng, Hu Yongting, Xu Yong. Survey of Audio-Driven Talking Face Video Generation and Identification[J]. Journal of Computer Research and Development, 2025, 62(10): 2523-2544.
[2] Prajwal K R, Mukhopadhyay R, Namboodiri V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 484-492.
[3] Wang S, Li L, Ding Y, et al. Audio2head: Audio-driven one-shot talking-head generation with natural head motion[J]. arXiv preprint arXiv:2107.09293, 2021.
[4] Chen Z, Cao J, Chen Z, et al. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(3): 2403-2410.
[5] Peng Z, Liu J, Zhang H, et al. Omnisync: Towards universal lip synchronization via diffusion transformers[J]. arXiv preprint arXiv:2505.21448, 2025.
[6] Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 4195-4205.
[7] Cui J, Li H, Zhan Y, et al. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 21086-21095.
[8] Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020.
[9] Chung J S, Zisserman A. Out of time: automated lip sync in the wild[C]//Asian Conference on Computer Vision. Cham: Springer International Publishing, 2016: 251-263.
[10] 张冰源, 张旭龙, 王健宗, 等. 数字说话人脸生成技术综述[J]. 大数据, 2024,10(5): 74-95.
ZHANG Bingyuan, ZHANG Xulong, WANG Jianzong, et al. Survey of audio-driven talking face generation technology[J]. Big data research, 2024, 10(5): 74-95.
[11] Son Chung J, Senior A, Vinyals O, et al. Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017: 6447-6456.
[12] Zhang Z, Li L, Ding Y, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision And Pattern Recognition. 2021: 3661-3670.
[13] Chung J S, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition[J]. arXiv preprint arXiv:1806.05622, 2018.
[14] Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in Neural Information Processing Systems, 2017, 30.
[15] Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 586-595.
[16] Deng J, Guo J, Xue N, et al. Arcface: Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4690-4699.
[17] Guo J, Zhang D, Liu X, et al. Liveportrait: Efficient portrait animation with stitching and retargeting control[J]. arXiv preprint arXiv:2407.03168, 2024.
[18] King D E. Dlib-ml: A machine learning toolkit[J]. The Journal of Machine Learning Research, 2009, 10: 1755-1758.
[19] Lugaresi C, Tang J, Nash H, et al. Mediapipe: A framework for building perception pipelines[J]. arXiv preprint arXiv:1906.08172, 2019.
|