1 |
HAYKIN S, CHEN Z. The cocktail party problem. Neural Computation, 2005, 17(9): 1875- 1902.
doi: 10.1162/0899766054322964
|
2 |
BEE M A, MICHEYL C. The cocktail party problem: what is it? How can it be solved? And why should animal behaviorists study it?. Journal of Comparative Psychology, 2008, 122(3): 235.
doi: 10.1037/0735-7036.122.3.235
|
3 |
HERSHEY J R, CHEN Z, le ROUX J, et al. Deep clustering: discriminative embeddings for segmentation and separation[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2016: 31-35.
|
4 |
YU D, KOLBAEK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2017: 241-245.
|
5 |
GABBAY A, EPHRAT A, HALPERIN T, et al. Seeing through noise: speaker separation and enhancement using visually-derived speech[EB/OL]. [2023-04-10]. https://arxiv.org/abs/1708.06767.
|
6 |
CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//Proceedings of ACCV 2016. Berlin, Germany: Springer, 2017: 251-263.
|
7 |
黄雅婷, 石晶, 许家铭, 等. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 234- 251.
URL
|
|
HUANG Y T, SHI J, XU J M, et al. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234- 251.
URL
|
8 |
RAHNE T, BÖCKMANN M, von SPECHT H, et al. Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Research, 2007, 1144, 127- 135.
doi: 10.1016/j.brainres.2007.01.074
|
9 |
EPHRAT A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party. ACM Transactions on Graphics, 2018, 37(4): 1- 11.
|
10 |
GABBAY A, EPHRAT A, HALPERIN T, et al. Seeing through noise: visually driven speaker separation and enhancement[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2018: 3051-3055.
|
11 |
HOU J C, WANG S S, LAI Y H, et al. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2018, 2(2): 117- 128.
doi: 10.1109/TETCI.2017.2784878
|
12 |
ZHOU H, XU X D, LIN D H, et al. Sep-Stereo: visually guided stereophonic audio generation by associating source separation[C]//Proceedings of ECCV 2020. Berlin, Germany: Springer, 2020: 52-69.
|
13 |
|
14 |
ZHANG P, XU J M, SHI J, et al. Audio-visual speech separation with adversarially disentangled visual representation[EB/OL]. [2023-04-10]. https://arxiv.org/pdf/2011.14334.
|
15 |
|
16 |
徐亮, 王晶, 杨文镜, 等. 基于Conv-TasNet的多特征融合音视频联合语音分离算法. 信号处理, 2021, 37(10): 1799- 1805.
URL
|
|
XU L, WANG J, YANG W J, et al. Multi feature fusion audio-visual joint speech separation algorithm based on conv-TasNet. Journal of Signal Processing, 2021, 37(10): 1799- 1805.
URL
|
17 |
GAO R H, GRAUMAN K. VisualVoice: audio-visual speech separation with cross-modal consistency[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Washington D. C., USA: IEEE Press, 2021: 15490-15500.
|
18 |
|
19 |
MONTESINOS J F, KADANDALE V S, HARO G. VoViT: low latency graph-based audio-visual voice separation transformer[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 310-326.
|
20 |
MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 122-138.
|
21 |
WILLIAMSON D S, WANG Y X, WANG D L. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 483- 492.
doi: 10.1109/TASLP.2015.2512042
|
22 |
ZHANG S F, ZHU X Y, LEI Z, et al. S3FD: single shot scale-invariant face detector[C]//Proceedings of the IEEE International Conference on Computer Vision(ICCV). Washington D. C., USA: IEEE Press, 2017: 192-201.
|
23 |
MARTINEZ B, MA P C, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2020: 6319-6323.
|
24 |
CHO K, van MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL]. [2023-04-10]. https://arxiv.org/pdf/1406.1078.
|
25 |
|
26 |
AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(12): 8717- 8727.
doi: 10.1109/TPAMI.2018.2889052
|
27 |
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR—half-baked or well done?[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Washington D. C., USA: IEEE Press, 2019: 626-630.
|
28 |
THIEDE T, TREURNIET W C, BITTO R, et al. PEAQ—the ITU standard for objective measurement of perceived audio quality. AES: Journal of the Audio Engineering Society, 2000, 48(1/2): 3- 29.
|
29 |
TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125- 2136.
doi: 10.1109/TASL.2011.2114881
|
30 |
XIONG J, ZHANG P, XIE L, et al. Audio-visual speech separation based on joint feature representation with cross-modal attention[EB/OL]. [2023-04-10]. https://arxiv.org/pdf/2011.14334.
|