作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (2): 322-330. doi: 10.19678/j.issn.1000-3428.0069773

• 多模态与信息融合 • 上一篇    

基于多提示和图文对比学习的服装检索

孙圆, 王康平, 赵鸣博   

  1. 东华大学信息科学与技术学院, 上海 201620
  • 收稿日期:2024-04-22 修回日期:2024-08-18 发布日期:2024-10-21
  • 作者简介:孙圆(CCF学生会员),女,硕士研究生,主研方向为目标检测、重识别;王康平,硕士研究生;赵鸣博(CCF高级会员、通信作者),教授、博士生导师。E-mail:mzhao4@dhu.edu.cn
  • 基金资助:
    国家自然科学基金面上项目(NNSF61971121)。

Clothing Retrieval Based on Multiple Prompts and Contrastive Image-Text Learning

SUN Yuan, WANG Kangping, ZHAO Mingbo   

  1. College of Information Science and Technology, Donghua University, Shanghai 201620, China
  • Received:2024-04-22 Revised:2024-08-18 Published:2024-10-21

摘要: 随着多模态学习的不断发展,图像检索领域也面临新的机遇和挑战。现有的服装检索模型大多基于卷积神经网络或者Transformer的单模态模型实现,忽略了图像对应的丰富文本信息,模型能学习到的特征相对单一。为此,提出一种基于多提示和图文对比学习的服装检索方法。引入图像文本多提示学习,引导多模态大模型FashionCLIP学习服装的多维高语义多模态特征,为提高模型的检索能力以及充分挖掘多模态模型的检索潜力,分两阶段优化模型。第一阶段冻结图像和文本编码器,通过图像文本交叉熵损失函数优化文本提示;第二阶段冻结文本提示和文本编码器,通过三元组损失、分类损失和图像文本交叉熵损失函数优化图像提示和图像编码器。在淘宝直播多模态视频商品检索数据集WAB上的域内检索和跨域检索实验结果表明:该方法在域内检索的均值平均精度(mAP)和Rank-1相对于传统方法至少提升6.1和3.5百分点,在跨域检索的mAP和Rank-1相对于传统方法至少提升8.4和6.4百分点,检索性能得到了显著提升,证明了图文对比学习在服装检索领域的潜力。

关键词: 服装检索, 图文对比学习, 预训练模型, 跨模态检索, 提示学习

Abstract: With the continuous development of multimodal learning, the field of image retrieval is facing new opportunities and challenges. Most existing clothing retrieval models are based on convolutional neural networks or a Transformer's unimodal retrieval, ignoring the rich textual information corresponding to images. Moreover, the features that the model can learn tend to be relatively single. This study proposes a clothing retrieval method based on multiple prompts and contrastive image-text learning. This study introduces image and text multiprompt learning to guide a multimodal large model, called FashionCLIP, in learning the multidimensional, high semantic, and multimodal features of clothing. To improve the retrieval ability of the model and fully mine its multimodal potential, the model is optimized in two stages. In the first stage, the image and text encoders are frozen and the text prompt is optimized using image and text cross-entropy loss functions. In the second stage, the text prompt and text encoder are frozen, and the image prompt and image encoder are optimized using triple loss, classification loss, and image and text cross-entropy loss functions. Both intra- and cross-domain retrieval experiments were conducted on the Taobao Live multimodal video product retrieval dataset, known as WAB. The experimental results show that the mean Average Precision (mAP) of this method for intra-domain retrieval is improved by at least 6.1 percentage points compared to traditional models, and Rank-1 is improved by at least 3.5 percentage points compared to traditional models. This method improves the mAP compared to traditional models by at least 8.4 percentage points and Rank-1 by at least 6.4 percentage points in cross-domain retrieval. Additionally, the retrieval results are significantly improved, demonstrating the potential for contrastive learning in the field of clothing retrieval.

Key words: clothing retrieval, contrastive image-text learning, pre-trained model, cross-modal retrieval, prompt learning

中图分类号: