作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (11): 152-162. doi: 10.19678/j.issn.1000-3428.0069468

• 人工智能与模式识别 • 上一篇    下一篇

基于拓展图文对比学习的多模态语义对齐

安国成1, 江波2,*(), 王晓龙1, 戴军1   

  1. 1. 上海华讯网络系统有限公司服务运作部, 上海 201103
    2. 中国电子科技集团公司第三十二研究所, 上海 201808
  • 收稿日期:2024-03-04 出版日期:2024-11-15 发布日期:2024-08-16
  • 通讯作者: 江波
  • 基金资助:
    “十四五”国家重点研发计划项目(2023YFC3006700)

Multi-modal Semantic Alignment Based on Extended Image-Text Contrastive Learning

AN Guocheng1, JIANG Bo2,*(), WANG Xiaolong1, DAI Jun1   

  1. 1. Service Operations Department of Shanghai Huaxun Network System Co., Ltd., Shanghai 201103, China
    2. The 32nd Research Institute of China Electronics Technology Group Corporation, Shanghai 201808, China
  • Received:2024-03-04 Online:2024-11-15 Published:2024-08-16
  • Contact: JIANG Bo

摘要:

基于对比语言-图像的预训练(CLIP)方法在大规模图文数据上使双流架构下的模型能够较好地学习到统一的高级语义表征, 但CLIP模式仅约束图像-文本模态间的粗粒度语义对齐, 在同一模态下的语义表征仍需改进。为了使网络学习到更好的潜在统一语义表征, 提出一种基于拓展图文对比学习的多模态语义对齐方法。首先通过微调预训练的CLIP模型, 针对指定数据集优化语义表征, 设计双向匹配策略构造图文样本匹配拓扑图, 然后利用拓扑图中关联度更高的图文样本将对比学习进行拓展, 在图像-文本模态下进行粗粒度语义对齐, 同时在相同模态中进行细粒度调整, 并引入可学习参数调整各模态下的对比损失权重。通过在多个数据集下的实验结果表明, 该方法在不影响多模态语义对齐的性能下能够改进相同模态下的语义表征, 在分类、检索等下游任务上具有更好或相当的性能。

关键词: 多模态学习, 语义表征, 对比学习, 图文匹配, 图像分类

Abstract:

The current pre-training model, known as Contrastive Language-Image Pre-training (CLIP), facilitates a dual-stream architecture that allows the model to learn unified high-level semantic representations on large-scale text-image data. However, the CLIP model only enforces coarse-grained semantic alignment between image-text modalities, and semantic representation within the same modality requires further improvement. To assist the network in learning unified semantic representations more effectively, this study proposes a multimodal semantic alignment method based on extended image-text contrastive learning. First, by fine-tuning the pre-trained CLIP model, the semantic representation is optimized for the specified dataset, and a bidirectional matching strategy is designed to construct a text-image sample-matching topology graph. Contrastive learning is then extended using image-text samples with higher relevance in the topology graph, facilitating coarse-grained semantic alignment between image-text modalities. Fine-grained adjustments are performed within the same modality and learnable parameters are introduced to adjust the contrastive loss weights under each modality. Experiments on multiple datasets demonstrate that this method improves semantic representation within the same modality without affecting the performance of multimodal semantic alignment, and facilitates better or comparable performance in classification, retrieval, and other downstream tasks.

Key words: multi-modal learning, semantic representation, contrastive learning, image-text matching, image classification