Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (2): 413-422. doi: 10.19678/j.issn.1000-3428.0069611

• Large Language Models and Generative Artificial Intelligence • Previous Articles    

Text-to-Image Generation Method Combining Transformer and DF-GAN

MA Jing1, CHE Jin1,2, SUN Moxian1   

  1. 1. School of Electronic and Electrical Engineering, Ningxia University, Yinchuan 750000, Ningxia, China;
    2. Graduate School, Ningxia University, Yinchuan 750000, Ningxia, China
  • Received:2024-03-18 Revised:2024-08-29 Published:2026-02-04

融合Transformer与DF-GAN的文本生成图像方法

马静1, 车进1,2, 孙末贤1   

  1. 1. 宁夏大学电子与电气工程学院, 宁夏 银川 750000;
    2. 宁夏大学研究生院, 宁夏 银川 750000
  • 作者简介:马静,女,硕士研究生,主研方向为文本生成图像;车进(通信作者),教授、博士,E-mail:1581005897@qq.com;孙末贤,硕士研究生。
  • 基金资助:
    国家自然科学基金(62366042)。

Abstract: To address the failure of the text encoder to deeply mine text information in text-to-image generation tasks, which leads to semantic inconsistency in the subsequently generated images, a DXC-GAN method for text-to-image generation is proposed. This method introduces the Xtra Long Network (XLNet) pretraining model from the Transformer series to replace the original text encoder, enabling the capture of prior knowledge from a vast amount of text for deep mining of contextual information. A Convolutional Block Attention Module (CBAM) is added to increase the generator's focus on important information in images, thus solving the issues of incomplete image details and incorrect spatial structure. In the discriminator, contrastive loss is introduced and combined with match-aware gradient penalty and unidirectional output in the model, making images with the same semantics closer and those with different semantics further apart, thereby enhancing the semantic consistency between text and generated images. The experimental results show that compared to the DF-GAN model, the Inception Score (IS) and Fréchet Inception Distance (FID) on the CUB dataset for the proposed model improved by 4.42% and 17.96%, respectively. On the Oxford-102 dataset, the IS is 3.97 and the FID is 37.82. Evidently, compared to DF-GAN, DXC-GAN effectively avoids deformities such as multi-headedness and foot deficiency in bird image generation and significantly reduces image quality issues such as missing petals in flower image generation. Furthermore, it enhances the alignment between text and images, significantly improving the completeness and generation effect of images.

Key words: Generative Adversarial Network (GAN), text-to-image generation, XLNet, CBAM, contrastive loss

摘要: 文本生成图像任务中的文本编码器不能深度挖掘文本信息,导致后续生成的图像语义不一致。针对该问题,提出一种DXC-GAN文本生成图像方法。引入Transformer系列中的XLNet(Xtra Long Network)预训练模型替换原始文本编码器,捕获大量文本的先验知识,实现对上下文信息的深度挖掘。添加CBAM(Convolutional Block Attention Module)注意力模块,使生成器更加关注图像中的重要信息,从而解决生成图像细节不完整和空间结构错误问题。在判别器中引入对比损失,与模型中匹配感知梯度惩罚和单向输出结合,使得相同语义图像之间更加接近,不同语义图像之间更加疏远,从而增强文本与生成图像之间的语义一致性。实验结果表明:与DF-GAN相对比,DXC-GAN在CUB数据集上的IS(Inception Score)与FID(Fréchet Inception Distance)分别提升了4.42%和17.96%;在Oxford-102数据集上,IS为3.97,FID为37.82;相较于DF-GAN,DXC-GAN在鸟类图像生成方面有效避免了多头少脚等畸形问题,同时在花卉图像生成上也显著减少了花瓣残缺等图像质量问题;此外,DXC-GAN还增强了文本与图像的对齐性,显著提升了图像的完整度和生成效果。

关键词: 生成对抗网络, 文本生成图像, XLNet, CBAM, 对比损失

CLC Number: