作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (5): 229-240. doi: 10.19678/j.issn.1000-3428.0067524

• 图形图像处理 • 上一篇    下一篇

基于CLIP和双空间自适应归一化的图像翻译

李田芳1, 普园媛1,2, 赵征鹏1, 徐丹1, 钱文华1   

  1. 1. 云南大学信息学院, 云南 昆明 650504;
    2. 云南省高校物联网技术及应用重点实验室, 云南 昆明 650500
  • 收稿日期:2023-04-28 修回日期:2023-08-27 发布日期:2024-05-14
  • 通讯作者: 普园媛,E-mail:yuanyuanpu@ynu.edu.cn E-mail:yuanyuanpu@ynu.edu.cn
  • 基金资助:
    国家自然科学基金 (61163019,61271361,61761046,U1802271,61662087,62061049);云南省科技厅项目(2014FA021,2018FB100);云南省科技厅应用基础研究计划重点项目(202001BB050043,2019FA044);云南省重大科技专项计划项目(202002AD080001);云南省中青年学术技术带头人后备人才(2019HB121)。

Image-to-Image Translation Based on CLIP and Dual-Spatially Adaptive Normalization

LI Tianfang1, PU Yuanyuan1,2, ZHAO Zhengpeng1, XU Dan1, QIAN Wenhua1   

  1. 1. School of Information Science and Engineering, Yunnan University, Kunming 650504, Yunnan, China;
    2. University Key Laboratory of Internet of Things Technology and Application, Yunnan Province, Kunming 650500, Yunnan, China
  • Received:2023-04-28 Revised:2023-08-27 Published:2024-05-14
  • Contact: 普园媛,E-mail:yuanyuanpu@ynu.edu.cn E-mail:yuanyuanpu@ynu.edu.cn

摘要: 现有的图像翻译方法大多依赖数据集域标签来完成翻译任务,这种依赖往往限制了它们的应用范围。针对完全无监督图像翻译任务的方法能够解决域标签的限制问题,但是普遍存在源域信息丢失的现象。为了解决上述2个问题,提出一种基于对比学习语言-图像预训练(CLIP)的无监督图像翻译模型。首先,引入CLIP相似性损失对图像的风格特征施加约束,以在不使用数据集域标签的情况下增强模型传递图像风格信息的能力和准确性;其次,对自适应实例归一化(AdaIN)进行改进,设计一个新的双空间自适应归一化(DSAdaIN)模块,在特征的风格化阶段添加网络的学习和自适应交互过程,以加强对内容源域信息的保留;最后,设计一个鉴别器对比损失来平衡对抗网络损失的训练和优化过程。在多个公开数据集上的实验结果表明,与StarGANv2、StyleDIS等模型相比,该模型可在准确传递图像风格信息的同时保留一定的源域信息,且在定量评估指标FID分数和KID分数上分别提升了近3.35和0.57×102,实现了较好的图像翻译性能。

关键词: 图像翻译, 生成对抗网络, 对比学习语言-图像预训练模型, 自适应实例归一化, 对比学习

Abstract: Most existing image-to-image translation methods rely on dataset domain labels, which often limits their application. Although the current methods for truly unsupervised image-to-image translation tasks can address the limitations of domain labels, the loss of source-domain information remains widespread. To address these two problems simultaneously, an unsupervised image-to-image translation model based on Contrastive Language-Image Pre-training (CLIP) is proposed. First, constraints are placed on style features by introducing CLIP similarity loss to enhance the ability and accuracy of the model to convey image-style information without using dataset domain labels. Next, by improving the Adaptive Instance Normalization (AdaIN), a new Dual-Spatially Adaptive Instance Normalization (DSAdaIN) module is designed to add the learning and adaptive interaction processes of the network in the stylized stage of features to enhance the retention of content source domain information. Finally, the training and optimization processes for the adversarial network loss are balanced by designing a discriminator contrastive loss. Experimental results on multiple public datasets demonstrate that the proposed model can accurately transfer the image style information while retaining certain source domain information compared with other models such as StarGANv2 and StyleDIS, and it has improved the quantitative evaluation metrics Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) scores by approximately 3.35 and 0.57×102 orders of magnitude, respectively, successfully achieving a good image-to-image translation performance.

Key words: image-to-image translation, Generative Adversarial Networks (GAN), Contrastive Language-Image Pre-training(CLIP) model, Adaptive Instance Normalization(AdaIN), contrastive learning

中图分类号: