基于CLIP和双空间自适应归一化的图像翻译

doi:10.19678/j.issn.1000-3428.0067524

计算机工程 ›› 2024, Vol. 50 ›› Issue (5): 229-240. doi: 10.19678/j.issn.1000-3428.0067524

基于CLIP和双空间自适应归一化的图像翻译

李田芳¹, 普园媛^1,2, 赵征鹏¹, 徐丹¹, 钱文华¹

1. 云南大学信息学院, 云南昆明 650504;
2. 云南省高校物联网技术及应用重点实验室, 云南昆明 650500

收稿日期:2023-04-28 修回日期:2023-08-27 发布日期:2024-05-14
通讯作者: 普园媛,E-mail:yuanyuanpu@ynu.edu.cn E-mail:yuanyuanpu@ynu.edu.cn
基金资助:
国家自然科学基金 (61163019,61271361,61761046,U1802271,61662087,62061049);云南省科技厅项目(2014FA021,2018FB100);云南省科技厅应用基础研究计划重点项目(202001BB050043,2019FA044);云南省重大科技专项计划项目(202002AD080001);云南省中青年学术技术带头人后备人才(2019HB121)。

Image-to-Image Translation Based on CLIP and Dual-Spatially Adaptive Normalization

LI Tianfang¹, PU Yuanyuan^1,2, ZHAO Zhengpeng¹, XU Dan¹, QIAN Wenhua¹

1. School of Information Science and Engineering, Yunnan University, Kunming 650504, Yunnan, China;
2. University Key Laboratory of Internet of Things Technology and Application, Yunnan Province, Kunming 650500, Yunnan, China

Received:2023-04-28 Revised:2023-08-27 Published:2024-05-14
Contact: 普园媛,E-mail:yuanyuanpu@ynu.edu.cn E-mail:yuanyuanpu@ynu.edu.cn

摘要/Abstract

摘要： 现有的图像翻译方法大多依赖数据集域标签来完成翻译任务,这种依赖往往限制了它们的应用范围。针对完全无监督图像翻译任务的方法能够解决域标签的限制问题,但是普遍存在源域信息丢失的现象。为了解决上述2个问题,提出一种基于对比学习语言-图像预训练(CLIP)的无监督图像翻译模型。首先,引入CLIP相似性损失对图像的风格特征施加约束,以在不使用数据集域标签的情况下增强模型传递图像风格信息的能力和准确性;其次,对自适应实例归一化(AdaIN)进行改进,设计一个新的双空间自适应归一化(DSAdaIN)模块,在特征的风格化阶段添加网络的学习和自适应交互过程,以加强对内容源域信息的保留;最后,设计一个鉴别器对比损失来平衡对抗网络损失的训练和优化过程。在多个公开数据集上的实验结果表明,与StarGANv2、StyleDIS等模型相比,该模型可在准确传递图像风格信息的同时保留一定的源域信息,且在定量评估指标FID分数和KID分数上分别提升了近3.35和0.57×10²,实现了较好的图像翻译性能。

关键词: 图像翻译, 生成对抗网络, 对比学习语言-图像预训练模型, 自适应实例归一化, 对比学习

Abstract: Most existing image-to-image translation methods rely on dataset domain labels, which often limits their application. Although the current methods for truly unsupervised image-to-image translation tasks can address the limitations of domain labels, the loss of source-domain information remains widespread. To address these two problems simultaneously, an unsupervised image-to-image translation model based on Contrastive Language-Image Pre-training (CLIP) is proposed. First, constraints are placed on style features by introducing CLIP similarity loss to enhance the ability and accuracy of the model to convey image-style information without using dataset domain labels. Next, by improving the Adaptive Instance Normalization (AdaIN), a new Dual-Spatially Adaptive Instance Normalization (DSAdaIN) module is designed to add the learning and adaptive interaction processes of the network in the stylized stage of features to enhance the retention of content source domain information. Finally, the training and optimization processes for the adversarial network loss are balanced by designing a discriminator contrastive loss. Experimental results on multiple public datasets demonstrate that the proposed model can accurately transfer the image style information while retaining certain source domain information compared with other models such as StarGANv2 and StyleDIS, and it has improved the quantitative evaluation metrics Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) scores by approximately 3.35 and 0.57×10² orders of magnitude, respectively, successfully achieving a good image-to-image translation performance.

Key words: image-to-image translation, Generative Adversarial Networks (GAN), Contrastive Language-Image Pre-training(CLIP) model, Adaptive Instance Normalization(AdaIN), contrastive learning

中图分类号:

TP391

李田芳, 普园媛, 赵征鹏, 徐丹, 钱文华. 基于CLIP和双空间自适应归一化的图像翻译[J]. 计算机工程, 2024, 50(5): 229-240.

LI Tianfang, PU Yuanyuan, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Image-to-Image Translation Based on CLIP and Dual-Spatially Adaptive Normalization[J]. Computer Engineering, 2024, 50(5): 229-240.

https://www.ecice06.com/CN/Y2024/V50/I5/229

参考文献

[1] GATYS L A, ECKER A S, BETHGE M. Image style transfer using convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2016:2414-2423.
[2] 陈佳, 董学良,梁金星,等. 基于注意力机制的CycleGAN服装局部风格迁移研究[J]. 计算机工程, 2021, 47(11):305-312. CHEN J, DONG X L, LIANG J X, et al. Research on the local style transfer of clothing images by CycleGAN based on attention mechanism[J]. Computer Engineering, 2021, 47(11):305-312.(in Chinese)
[3] ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[C]//Proceedings of European Conference on Computer Vision. Berlin,Germany:Springer,2016:649-666.
[4] 刘航, 普园媛,吕大华,等. 极化自注意力约束颜色溢出的图像自动上色[J]. 计算机科学, 2023, 50(3):208-215. LIU H, PU Y Y, LV D H, et al. Polarized self-attention constrains color overflow in automatic coloring of image[J]. Computer Science, 2023, 50(3):208-215.(in Chinese)
[5] DONG C, LOY C C, HE K M, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(2):295-307.
[6] 陈乔松, 蒲柳,张羽,等. 结合整体注意力与分形稠密特征的图像超分辨率重建[J]. 计算机工程, 2022, 48(11):207-214, 223. CHEN Q S, PU L, ZHANG Y, et al. Image super-resolution reconstruction combining holistic attention and fractal density feature[J]. Computer Engineering, 2022, 48(11):207-214, 223.(in Chinese)
[7] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[EB/OL].[2023-03-05].http://www.arxiv.org/pdf/1406.2661v1.pdf.
[8] ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D.C.,USA:IEEE Press,2017:2223-2232.
[9] CHOI Y, UH Y, YOO J, et al. StarGANv2:diverse image synthesis for multiple domains[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2020:8188-8197.
[10] BAEK K, CHOI Y, UH Y, et al. Rethinking the truly unsupervised image-to-image translation[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2021:14154-14163.
[11] LEE H, SEOL J, LEE S G. Contrastive learning for unsupervised image-to-image translation[EB/OL].[2023-03-01].https://arxiv.org/abs/2105.03117.
[12] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of 2021 International Conference on Machine Learning.New York,USA:ACM Press,2021:8748-8763.
[13] LIU M Y, BREUEL T, KAUTZ J. Unsupervised image-to-image translation networks[EB/OL].[2023-03-01]. http://arxiv.org/abs/1703.00848.
[14] PARK T, EFROS A A, ZHANG R, et al. Contrastive learning for unpaired image-to-image translation[C]//Proceedings of European Conference on Computer Vision. Berlin,Germany:Springer,2020:319-345.
[15] WANG W L, ZHOU W G, BAO J M, et al. Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D.C.,USA:IEEE Press,2021:14020-14029.
[16] CHOI Y, CHOI M, KIM M, et al. StarGAN:unified generative adversarial networks for multi-domain image-to-image translation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2018:8789-8797.
[17] PATASHNIK O, WU Z, SHECHTMAN E, et al. StyleCLIP:text-driven manipulation of stylegan imagery[C]//Proceedings of 2021 IEEE International Conference on Computer Vision. Washington D.C.,USA:IEEE Press,2021:2085-2094.
[18] KARRAS T, LAINE S, AILA T M. A style-based generator architecture for generative adversarial networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2019:4401-4410.
[19] HUANG X, BELONGIE S. Arbitrary style transfer in real-time with adaptive instance normalization[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D.C.,USA:IEEE Press,2017:1501-1510.
[20] PARK T, LIU M Y, WANG T C, et al. Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press, 2019:2337-2346.
[21] LUO X, HAN Z, YANG L, et al. Consistent style transfer[EB/OL].[2023-03-01].https://arxiv.org/abs/2201.02233.
[22] KARRAS T, AILA T, LAINE S, et al. Progressive growing of GANs for improved quality, stability, and variation[C]//Proceedings of 2018 International Conference on Learning Representations. Washington D.C.,USA:IEEE Press,2018:123-156.
[23] KIM J, KIM M, KANG H, et al. U-GAT-IT:unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation[C]//Proceedings of 2019 International Conference on Learning Representations. Washington D.C.,USA:IEEE Press,2019:23-29.
[24] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[EB/OL].[2023-03-01].https://arxiv.org/abs/1706.08500.
[25] BI$KOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[C]//Proceedings of 2018 International Conference on Learning Representations. Washington D.C.,USA:IEEE Press,2018:145-165.
[26] DENG J, DONG W, SOCHER R, et al. ImageNet:a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2009:248-255.
[27] GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[EB/OL].[2023-03-01].https://arxiv.org/pdf/1704.00028.pdf.
[28] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the Inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2016:2818-2826.
[29] KINGMA D P, BA J. Adam:a method for stochastic optimization[EB/OL].[2023-03-01].https://arxiv.org/abs/1412.6980v6.
[30] KIM K, PARK S, JEON E, et al. A style-aware discriminator for controllable image translation[C]//Proceedings of 2022 IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2022:18239-18248.

选择文件类型/文献管理软件名称

选择包含的内容

基于CLIP和双空间自适应归一化的图像翻译

Image-to-Image Translation Based on CLIP and Dual-Spatially Adaptive Normalization

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	高爽, 史轶伦, 徐巧枝, 于磊. 基于对比学习的非对称编解码结构的心脏MRI分割研究[J]. 计算机工程, 2024, 50(8): 290-300.
[2]	王夙喆, 张雪英, 陈晓玉, 李凤莲, 吴泽林. 基于有效注意力和GAN结合的脑卒中EEG增强算法[J]. 计算机工程, 2024, 50(8): 336-344.
[3]	钱来, 赵卫伟. 基于对比学习和注意力机制的文本分类方法[J]. 计算机工程, 2024, 50(7): 104-111.
[4]	刘娟, 段友祥, 陆誉翕, 张鲁. 引入知识增强和对比学习的知识图谱补全[J]. 计算机工程, 2024, 50(7): 112-122.
[5]	胡庆. 多尺度融合与双输出U-Net网络的行人重识别[J]. 计算机工程, 2024, 50(6): 102-109.
[6]	张慧妍, 梁勇, 兰景宏, 赵强. 基于记忆模块与过滤式生成对抗网络的入侵检测方法[J]. 计算机工程, 2024, 50(6): 197-207.
[7]	武星, 殷浩宇, 姚骏峰, 李卫民, 钱权. 面向视频数据的多模态情感分析[J]. 计算机工程, 2024, 50(6): 218-227.
[8]	张宝鑫, 杨丹, 聂铁铮, 寇月. 基于自监督的多视角图协同过滤推荐方法[J]. 计算机工程, 2024, 50(5): 100-110.
[9]	张洪程, 李林育, 杨莉, 伞晨峻, 尹春林, 颜冰, 于虹, 张璇. 基于对比学习与语言模型增强嵌入的知识图谱补全[J]. 计算机工程, 2024, 50(4): 168-176.
[10]	李政学, 李枝名, 彭德中, 陈杰. 基于特征对比学习和图卷积的社交网络用户分类[J]. 计算机工程, 2024, 50(4): 258-266.
[11]	吴冠荣, 李元祥, 王艺霖, 陆雨寒, 陈秀华. 基于对比学习的小样本金属表面损伤分类[J]. 计算机工程, 2024, 50(3): 36-43.
[12]	刘帅威, 李智, 王国美, 张丽. 基于Transformer和GAN的对抗样本生成算法[J]. 计算机工程, 2024, 50(2): 180-187.
[13]	何银银, 胡静, 陈志泊, 张荣国. 融合门控变换机制和GAN的低光照图像增强方法[J]. 计算机工程, 2024, 50(2): 247-255.
[14]	张美美, 秦品乐, 柴锐, 曾建潮, 翟双姣, 闫俊义, 冯二燕. 面向急性缺血性脑卒中的CT生成MRI算法[J]. 计算机工程, 2024, 50(2): 317-326.
[15]	戴磊, 曹林, 郭亚男, 张帆, 杜康宁. 基于生成对抗网络的深度伪造跨模型防御方法[J]. 计算机工程, 2024, 50(10): 100-109.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于CLIP和双空间自适应归一化的图像翻译

Image-to-Image Translation Based on CLIP and Dual-Spatially Adaptive Normalization

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价