Multi-Modal Fine-Grained Retrieval Based on Modal Specific and Modal Shared Feature Information

doi:10.19678/j.issn.1000-3428.0063185

Abstract

Abstract: The goal of cross-modal retrieval is that the user gives any sample as a query sample;then, the system retrieves and feeds back various modal samples related to the query sample.Multi-modal fine-grained retrieval emphasizes that the number of modalities is greater than two and the granularity of classification is the fine-grained sub-category.This paper introduces the concepts of modal specific features and modal shared features and proposes the MS2Net framework.The branch network and backbone network are used to extract the modal specific features and modal shared features of different modal data.Then, the two features are fully fused through the Multi-Modal Feature fusion Module(MMFM).Meanwhile, the semantic information contained in the high-dimensional space vector is greatly increased by using the unique information of each mode and the commonness and relationship between different modal data.In addition, for the multi-modal fine-grained retrieval scenario, based on center loss, this paper proposes multi-center loss, introduces the inner-class center to gather the samples of the same category and the same mode, and then indirectly gathers the samples of the same category but different modes by aggregating the inner-class center.This reduces the heterogeneous gap and semantic gap between the samples.It clearly enhances the clustering ability of the model to high-dimensional spatial vectors.Finally, the experimental results of one-to-one and one-to-multimodal retrieval on the FG-Xmedia public dataset show that, compared with the FGCrossNet method, the MS2Net method improves the mAP index by 65% and 48%, respectively.

Key words: information retrieval, multi-modal retrieval, fine-grained retrieval, multi-modal representation learning, deep learning

摘要： 跨模态检索的目标是用户给定任意一个样本作为查询样例，系统检索得到与查询样例相关的各个模态样本，多模态细粒度检索在跨模态检索基础上强调模态的数量至少大于两个，且待检索样本的分类标准为细粒度子类，存在多模态数据间的异构鸿沟及细粒度样本特征差异小等难题。引入模态特异特征及模态共享特征的概念，提出一种多模态细粒度检索框架MS2Net。使用分支网络及主干网络分别提取不同模态数据的模态特异特征及模态共享特征，将两种特征通过多模态特征融合模块进行充分融合，同时利用各个模态自身的特有信息及不同模态数据间的共性及联系，增加高维空间向量中包含的语义信息。针对多模态细粒度检索场景，在center loss函数的基础上提出multi-center loss函数，并引入类内中心来聚集同类别且同模态的样本，根据聚集类内中心来间接聚集同类别但模态不同的样本，同时消减样本间的异构鸿沟及语义鸿沟，增强模型对高维空间向量的聚类能力。在公开数据集FG-Xmedia上进行一对一与一对多的模态检索实验，结果表明，与FGCrossNet方法相比，MS2Net方法mAP指标分别提升65%和48%。

关键词: 信息检索, 多模态检索, 细粒度检索, 多模态表征学习, 深度学习

CLC Number:

TP391.3

LI Pei, CHEN Qiaosong, CHEN Pengchang, DENG Xin, WANG Jin, PIAO Changhao. Multi-Modal Fine-Grained Retrieval Based on Modal Specific and Modal Shared Feature Information[J]. Computer Engineering, 2022, 48(11): 62-68,76.

李佩, 陈乔松, 陈鹏昌, 邓欣, 王进, 朴昌浩. 基于模态特异及模态共享特征信息的多模态细粒度检索[J]. 计算机工程, 2022, 48(11): 62-68,76.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0063185

http://www.ecice06.com/EN/Y2022/V48/I11/62

Figures/Tables 8

References

[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].[2021-10-08].https://arxiv.org/abs/2010.11929.
[2] TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al.MLP-mixer:an all-MLP architecture for vision[EB/OL].[2021-10-08].https://arxiv.org/abs/2105.01601.
[3] ZHANG Y, WALLACE B.A sensitivity analysis of(and practitioners' guide to) convolutional neural networks for sentence classification[EB/OL].[2021-10-08].https://arxiv.org/abs/1510.03820.
[4] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2021-10-08].https://arxiv.org/abs/1810.04805.
[5] CHIU C C, SAINATH T N, WU Y H, et al.State-of-the-art speech recognition with sequence-to-sequence models[C]//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2018:4774-4778.
[6] GULATI A, QIN J, CHIU C C, et al.convolution-augmented transformer for speech recognition[EB/OL].[2021-10-08].https://arxiv.org/abs/2005.08100.
[7] XU R C, NIU L, ZHANG J F, et al.A proposal-based approach for activity image-to-video retrieval[J].Artificial Intelligence, 2020, 34(7):12524-12531.
[8] XU X, SONG J K, LU H M, et al.Modal-adversarial semantic learning network for extendable cross-modal retrieval[C]//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval.New York, USA:ACM Press, 2018:46-54.
[9] JIANG Q Y, LI W J.Deep cross-modal hashing[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3270-3278.
[10] CHEN Y C, LI L J, YU L C, et al.UNITER:UNiversal image-TExt representation learning[C]//Proceedings of ECCVʼ20.Berlin, Germany:Springer, 2020:104-120.
[11] ZHEN L L, HU P, WANG X, et al.Deep supervised cross-modal retrieval[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:10386-10395.
[12] WEN Y D, ZHANG K P, LI Z F, et al.A discriminative feature learning approach for deep face recognition[C]//Proceedings of European Conference on Computer Vision.Berlin, German:Springer, 2016:499-515.
[13] SCHROFF F, KALENICHENKO D, PHILBIN J.FaceNet:a unified embedding for face recognition and clustering[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:815-823.
[14] GU J X, CAI J F, JOTY S, et al.Look, imagine and match:improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7181-7189.
[15] ZHANG Q, LEI Z, ZHANG Z X, et al.Context-aware attention network for image-text retrieval[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:3533-3542.
[16] WANG B K, YANG Y, XU X, et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Conference on Multimedia.New York, USA:ACM Press, 2017:154-162.
[17] HE X T, PENG Y X.Fine-grained visual-textual representation learning[J].IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(2):520-531.
[18] HE X T, PENG Y X, XIE L.A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia.New York, USA:ACM Press, 2019:1740-1748.
[19] LU Y, WU Y, LIU B, et al.Cross-modality person re-identification with shared-specific feature transfer[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:13376-13386.
[20] WANG H, SAHOO D, LIU C H, et al.Cross-modal food retrieval:learning a joint embedding of food images and recipes with semantic consistency and attention mechanism[J].IEEE Transactions on Multimedia, 2022, 24(3):2515-2525.
[21] UDANDARAO V, MAITI A, SRIVATSAV D, et al.COBRA:contrastive bi-modal representation algorithm[EB/OL].[2021-10-08].https://arxiv.org/abs/2005.03687.
[22] NARAYANA P, PEDNEKAR A, KRISHNAMOORTHY A, et al.HUSE:hierarchical universal semantic embeddings[EB/OL].[2021-10-08].https://arxiv.org/abs/1911. 05978.
[23] XIONG C Y, ZHANG D Y, LIU T, et al.Voice-face cross-modal matching and retrieval:a benchmark[EB/OL].[2021-10-08].https://arxiv.org/abs/1911.09338.
[24] TAN M X, LE Q V.MixNet:mixed depthwise convolutional kernels[EB/OL].[2021-10-08].https://arxiv.org/abs/1907.09595.
[25] XU K, BA J, KIROS R, et al.Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of International Conference on Machine Learning.New York, USA:ACM Press, 2015:2048-2057.
[27] ZHU C, TAN X, ZHOU F, et al.Fine-grained video categorization with redundancy reduction attention[C]//Proceedings of ECCVʼ18.Berlin, Germany:Springer, 2018:139-155.
[28] HUANG X, PENG Y X, YUAN M K.MHTN:modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Transactions on Cybernetics, 2020, 50(3):1047-1059.
[29] ZHAI X H, PENG Y X, XIAO J G.Learning cross-media joint representation with sparse and semi-supervised regularization[J].IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6):965-978.
[30] MANDAL D, CHAUDHURY K N, BISWAS S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2633-2641.
[31] PENG Y, HUANG X, QI J.Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceedings of IEEE IJCAIʼ16.Washington D.C., USA:IEEE Press, 2016:3846-3853.

Please choose a citation manager

Content to export