[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].[2021-10-08].https://arxiv.org/abs/2010.11929. [2] TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al.MLP-mixer:an all-MLP architecture for vision[EB/OL].[2021-10-08].https://arxiv.org/abs/2105.01601. [3] ZHANG Y, WALLACE B.A sensitivity analysis of(and practitioners' guide to) convolutional neural networks for sentence classification[EB/OL].[2021-10-08].https://arxiv.org/abs/1510.03820. [4] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2021-10-08].https://arxiv.org/abs/1810.04805. [5] CHIU C C, SAINATH T N, WU Y H, et al.State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2018:4774-4778. [6] GULATI A, QIN J, CHIU C C, et al.convolution-augmented transformer for speech recognition[EB/OL].[2021-10-08].https://arxiv.org/abs/2005.08100. [7] XU R C, NIU L, ZHANG J F, et al.A proposal-based approach for activity image-to-video retrieval[J].Artificial Intelligence, 2020, 34(7):12524-12531. [8] XU X, SONG J K, LU H M, et al.Modal-adversarial semantic learning network for extendable cross-modal retrieval[C]//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval.New York, USA:ACM Press, 2018:46-54. [9] JIANG Q Y, LI W J.Deep cross-modal hashing[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3270-3278. [10] CHEN Y C, LI L J, YU L C, et al.UNITER:UNiversal image-TExt representation learning[C]//Proceedings of ECCVʼ20.Berlin, Germany:Springer, 2020:104-120. [11] ZHEN L L, HU P, WANG X, et al.Deep supervised cross-modal retrieval[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:10386-10395. [12] WEN Y D, ZHANG K P, LI Z F, et al.A discriminative feature learning approach for deep face recognition[C]//Proceedings of European Conference on Computer Vision.Berlin, German:Springer, 2016:499-515. [13] SCHROFF F, KALENICHENKO D, PHILBIN J.FaceNet:a unified embedding for face recognition and clustering[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:815-823. [14] GU J X, CAI J F, JOTY S, et al.Look, imagine and match:improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7181-7189. [15] ZHANG Q, LEI Z, ZHANG Z X, et al.Context-aware attention network for image-text retrieval[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:3533-3542. [16] WANG B K, YANG Y, XU X, et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Conference on Multimedia.New York, USA:ACM Press, 2017:154-162. [17] HE X T, PENG Y X.Fine-grained visual-textual representation learning[J].IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(2):520-531. [18] HE X T, PENG Y X, XIE L.A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia.New York, USA:ACM Press, 2019:1740-1748. [19] LU Y, WU Y, LIU B, et al.Cross-modality person re-identification with shared-specific feature transfer[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:13376-13386. [20] WANG H, SAHOO D, LIU C H, et al.Cross-modal food retrieval:learning a joint embedding of food images and recipes with semantic consistency and attention mechanism[J].IEEE Transactions on Multimedia, 2022, 24(3):2515-2525. [21] UDANDARAO V, MAITI A, SRIVATSAV D, et al.COBRA:contrastive bi-modal representation algorithm[EB/OL].[2021-10-08].https://arxiv.org/abs/2005.03687. [22] NARAYANA P, PEDNEKAR A, KRISHNAMOORTHY A, et al.HUSE:hierarchical universal semantic embeddings[EB/OL].[2021-10-08].https://arxiv.org/abs/1911. 05978. [23] XIONG C Y, ZHANG D Y, LIU T, et al.Voice-face cross-modal matching and retrieval:a benchmark[EB/OL].[2021-10-08].https://arxiv.org/abs/1911.09338. [24] TAN M X, LE Q V.MixNet:mixed depthwise convolutional kernels[EB/OL].[2021-10-08].https://arxiv.org/abs/1907.09595. [25] XU K, BA J, KIROS R, et al.Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of International Conference on Machine Learning.New York, USA:ACM Press, 2015:2048-2057. [27] ZHU C, TAN X, ZHOU F, et al.Fine-grained video categorization with redundancy reduction attention[C]//Proceedings of ECCVʼ18.Berlin, Germany:Springer, 2018:139-155. [28] HUANG X, PENG Y X, YUAN M K.MHTN:modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Transactions on Cybernetics, 2020, 50(3):1047-1059. [29] ZHAI X H, PENG Y X, XIAO J G.Learning cross-media joint representation with sparse and semi-supervised regularization[J].IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6):965-978. [30] MANDAL D, CHAUDHURY K N, BISWAS S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2633-2641. [31] PENG Y, HUANG X, QI J.Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceedings of IEEE IJCAIʼ16.Washington D.C., USA:IEEE Press, 2016:3846-3853. |