[1] 马金林,巩元文,马自萍,等.唇语识别的视觉特征提取方法综述[J].计算机科学与探索,2021,15(12):2256-2275.
Ma J L,Gong Y W,Ma Z P,et al.Review of extracting methods for lip visual features[J].Journal of Frontiers of Computer Science and Technology,2021,15(12):2256-2275.
[2] Smith H M J, Ritchie K L, Baguley T S, et al. Face and voice identity matching accuracy is not improved by multimodal identity information[J]. British Journal of Psychology, 2025, 116(2): 367-385.
[3] Wu H,Wang D,Liu Y Y,et al.Decoding Subject's Own Name in the Primary Auditory Cortex[J]. Human Brain Mapping,2023,44(5):1985-1996.
[4] Liu Y,Wang Z Y,Ji S L,et al.Lip-Audio Modality Fusion for Deep Forgery Video Detection[J]. Computers, Materials & Continua,2025,82(2):3499-3515.
[5] Nagrani A, Albanie S, Zisserman A,et al. Seeing voices and hearing faces: cross-modal biometric matching [C]//InProceedings of the IEEE conference on computer vision and pattern recognition. New York:IEEE, 2018: 8427-8436.
[6] Wen Y, Ismail M A, Liu W, et al. Disjoint mapping network for cross-modal matching of voices and faces[J].Arxiv Preprint Arxiv:1807.04836,2018.
[7] 柴汶泽,范菁,孙书魁,等.深度度量学习综述[J].计算机应用,2024,44(10):2995-3010.
Chai W Z,Fan J,Sun S K,et al.Overview of deep metric learning.Journal of Computer Applications,2024,44(10):2995-3010.
[8] Kim T, Kang J. Face and voice cross-modal association with learning convex feature embedding: T. Kim, J. Kang[J]. Multimedia Systems, 2025, 31(4): 296.
[9] 朱伟杰,陈莹.双流时间域信息交互的微表情识别卷积网络[J].计算机科学与探索,2022,16(04):950-958.
Zhu W J,Chen Y.Micro-expression recognition convolutional network for dual-stream temporal-domain information interaction[J].Journal of Frontiers of Computer Science and Technology, 2022,16(04):950-958.
[10] Chen W, Zhu B, Xu K,et al.VoiceStyle: Voice-based face generation via cross-modal prototype contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications,2024,20(9):1-23.
[11] Phan T, Vu N, Pham C,et al.Multi-task Learning based Voice Verification with Triplet Loss[C]//In 2020 International Conference on Multimedia Analysis and Pattern Recognition.New York:IEEE,2020:1-6.
[12] Zhao Y B, Lin J W,Xuan Q,et al.HPILN: a feature learning framework for cross-modality person re-identification[J].IET Image Processing, 2020, 13(14):2897-2904.
[13] Mahmud T, Mo S, Tian Y, et al. Ma-avt: Modality alignment for parameter-efficient audio-visual transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7996-8005.
[14] Shabtay N, Zimerman I, Schwartz E, et al. CLIMP: Contrastive Language-Image Mamba Pretraining[J]. arXiv preprint arXiv:2601.06891, 2026.
[15] Yu Z, Liu X, Cheung Y M, et al. Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching[C]//2022 IEEE International Conference on Data Mining (ICDM). IEEE, 2022: 648-655.
[16] Wen P, Xu Q, Jiang Y,et al. Seeking the shape of sound: An adaptive framework for learning voice-face association[C]//InProceedings of the IEEE/CVF conference on computer vision and pattern recognition.IEEE,2021:16347-16356.
[17] Zhang Q, Wei Y, Han Z, et al. Multimodal fusion on low-quality data: A comprehensive survey[J]. arXiv preprint arXiv:2404.18947, 2024.
[18] Shi H , Hayat M , Cai J .Unpaired referring expression grounding via bidirectional cross-modal matching[J].Neurocomputing, 2023, 518:39-49.
[19] Wang Q, Zhang P, Xiong H, et al. Face. evolve: A high-performance face recognition library[J]. arXiv preprint arXiv:2107.08621, 2021.
[20] Liu B, Wang H, Qian Y. Towards lightweight speaker verification via adaptive neural network quantization[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 3771-3784.
[21] Liu Y, Sun H, Guan W,et al. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework[J].Speech Communication,2022,1(139):1-9.
[22] Zhou J, Jia X, Li Q,et al. Uniface: Unified cross-entropy loss for deep face recognition[C]//InProceedings of the IEEE/CVF international conference on computer vision.Paris:IEEE,2023:20730-20739.
[23] Zolfaghari M, Zhu Y, Gehler P,et al. Crossclr: Cross-modal contrastive learning for multi-modal video representations[C]//InProceedings of the IEEE/CVF international conference on computer vision.IEEE,2021:1450-1459.
[24] Nagrani A, Chung J S, Xie W, et al. Voxceleb: Large-scale speaker verification in the wild[J]. Computer Speech & Language, 2020, 60: 101027.
[25] Parkhi O, Vedaldi A, Zisserman A,et al. Deep Face Recognition[C]//InBMVC 2015-Proceedings of the British Machine Vision Conference 2015.Swansea:BMVA,2015:4101-4112.
[26] Karamizadeh S, Shojae Chaeikar S, Salarian H. Combining MTCNN and Enhanced FaceNet with Adaptive Feature Fusion for Robust Face Recognition[J]. Technologies, 2025, 13(10): 450.
[27] McLoughlin I, Pham L, Song Y, et al. Spectrogram Features for Audio and Speech Analysis[J]. Applied Sciences, 2026, 16(2): 572.
[28] Peng C, He L, Su D,et al. Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder[J].arXiv:2404.09509,2024.
[29] Abidi S M H, Hassan S A, Raza S M, et al. Advances in Face Recognition: A Comprehensive Review of Feature Extraction and Dataset Evaluation[J]. Electronics, 2026, 15(2): 338.
[30] Li G, Gao Y, Huang X, et al. A Hard Negatives Mining and Enhancing Method for Multi-Modal Contrastive Learning[J]. Electronics, 2025, 14(4): 767.
[31] Rusak E, Reizinger P, Juhos A,et al. InfoNCE: Identifying the gap between theory and practice[J].arXiv:2407.00143,2024.
[32] Lahiri A, Kwatra V, Frueh C,et al. Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization[C]//InProceedings of the IEEE/CVF conference on computer vision and pattern recognition.IEEE,2021:2755-2764.
[33] Nguyen-Le H H, Tran V T, Nguyen D T, et al. Passive deepfake detection across multi-modalities: A comprehensive survey[J]. arXiv preprint arXiv:2411.17911, 2024.
[34] Nagrani A, Albanie S, Zisserman A,et al. Learnable pins: cross-modal embeddings for person identity[C]// Proceedings of the European Conference on Computer Vision(ECCV).Munich:Springer, 2018:71-88.
[35] Wang R, Liu X, Cheung Y, et al. Learning discriminative joint embeddings for efficient face and voice association[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1881-1884.
[36] Ma Q, Bu F, Wang R, et al. Cross-Modal Simplex Center Learning for Speech-Face Association[J]. Computers, Materials & Continua, 2025, 82(3).
|