Text-to-Image Generation Method Combining Transformer and DF-GAN

doi:10.19678/j.issn.1000-3428.0069611

Abstract

Abstract:

To address the failure of the text encoder to deeply mine text information in text-to-image generation tasks, which leads to semantic inconsistency in the subsequently generated images, a DXC-GAN method for text-to-image generation is proposed. This method introduces the Xtra Long Network (XLNet) pretraining model from the Transformer series to replace the original text encoder, enabling the capture of prior knowledge from a vast amount of text for deep mining of contextual information. A Convolutional Block Attention Module (CBAM) is added to increase the generator's focus on important information in images, thus solving the issues of incomplete image details and incorrect spatial structure. In the discriminator, contrastive loss is introduced and combined with match-aware gradient penalty and unidirectional output in the model, making images with the same semantics closer and those with different semantics further apart, thereby enhancing the semantic consistency between text and generated images. The experimental results show that compared to the DF-GAN model, the Inception Score (IS) and Fréchet Inception Distance (FID) on the CUB dataset for the proposed model improved by 4.42% and 17.96%, respectively. On the Oxford-102 dataset, the IS is 3.97 and the FID is 37.82. Evidently, compared to DF-GAN, DXC-GAN effectively avoids deformities such as multi-headedness and foot deficiency in bird image generation and significantly reduces image quality issues such as missing petals in flower image generation. Furthermore, it enhances the alignment between text and images, significantly improving the completeness and generation effect of images.

Key words: Generative Adversarial Network (GAN), text-to-image generation, XLNet, CBAM, contrastive loss

摘要：

文本生成图像任务中的文本编码器不能深度挖掘文本信息, 导致后续生成的图像语义不一致。针对该问题, 提出一种DXC-GAN文本生成图像方法。引入Transformer系列中的XLNet(Xtra Long Network)预训练模型替换原始文本编码器, 捕获大量文本的先验知识, 实现对上下文信息的深度挖掘。添加CBAM(Convolutional Block Attention Module)注意力模块, 使生成器更加关注图像中的重要信息, 从而解决生成图像细节不完整和空间结构错误问题。在判别器中引入对比损失, 与模型中匹配感知梯度惩罚和单向输出结合, 使得相同语义图像之间更加接近, 不同语义图像之间更加疏远, 从而增强文本与生成图像之间的语义一致性。实验结果表明: 与DF-GAN相对比, DXC-GAN在CUB数据集上的IS(Inception Score)与FID(Fréchet Inception Distance)分别提升了4.42%和17.96%;在Oxford-102数据集上, IS为3.97, FID为37.82;相较于DF-GAN, DXC-GAN在鸟类图像生成方面有效避免了多头少脚等畸形问题, 同时在花卉图像生成上也显著减少了花瓣残缺等图像质量问题; 此外, DXC-GAN还增强了文本与图像的对齐性, 显著提升了图像的完整度和生成效果。

关键词: 生成对抗网络, 文本生成图像, XLNet, CBAM, 对比损失

MA Jing, CHE Jin, SUN Moxian. Text-to-Image Generation Method Combining Transformer and DF-GAN[J]. Computer Engineering, 2026, 52(2): 413-422.

马静, 车进, 孙末贤. 融合Transformer与DF-GAN的文本生成图像方法[J]. 计算机工程, 2026, 52(2): 413-422.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069611

https://www.ecice06.com/EN/Y2026/V52/I2/413

Figures/Tables 12

Fig.1 Network framework diagram

Fig.2 XLNet text encoder flowchart

Fig.3 Overall structure of CBAM

Fig.4 Channel attention module structure

Fig.5 Spatial attention module structure

Fig.6 Schematic diagram of the loss for comparing related text images

Fig.7 Schematic diagram of the loss for comparing irrelevant text image

Fig.8 Comparison of images generated from the CUB dataset

Fig.9 Comparison of images generated from the Oxford-102 dataset

References 26

1	COLLOBERT R , WESTON J , BOTTOU L , et al. Natural language processing (almost) from Scratch. Journal of Machine Learning Research, 2011, 12, 2493- 2537.
2	ANDREW A M . Multiple view geometry in computer vision. Kybernetes, 2001, 30 (9/10): 1333- 1341. doi: 10.1108/k.2001.30.9_10.1333.2
3	GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144. doi: 10.1145/3422622
4	TAO M, TANG H, WU S S, et al. DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis[EB/OL]. [2024-01-02]. https://arxiv.org/abs/2008.05865v1.
5	YANG Z L , DAI Z H , YANG Y M , et al. XLNet: generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 2019, 8, 5753- 5763. URL
6	GRAVES A . Supervised sequence labelling with recurrent neural networks. Berlin, Germany: Springer, 2012: 37- 45.
7	任欢, 王旭光. 注意力机制综述. 计算机应用, 2021, 41 (S1): 1- 6.
	REN H , WANG X G . Review of attention mechanism. Journal of Computer Applications, 2021, 41 (S1): 1- 6.
8	CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning. New York, USA: ACM Press, 2020: 1597-1607.
9	REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis[C]//Proceedings of International Conference on Machine Learning. New York, USA: ACM Press, 2016: 1060-1069.
10	ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 5908-5916.
11	ZHANG H , XU T , LI H , et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (8): 1947- 1962. doi: 10.1109/TPAMI.2018.2856256
12	XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 1316-1324.
13	QIAO T T, ZHANG J, XU D Q, et al. MirrorGAN: learning text-to-image generation by redescription[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 1505-1514.
14	ZHU M F, PAN P B, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 5795-5803.
15	LIAO W T, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 18166-18175.
16	VASWANI A , SHAZEER N , PARMAR N , et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30, 5998- 6008. doi: 10.48550/arXiv.1706.03762
17	刘建伟, 宋志妍. 循环神经网络研究综述. 控制与决策, 2022, 37 (11): 2753- 2768. doi: 10.13195/j.kzyjc.2021.1241
	LIU J W , SONG Z Y . Overview of recurrent neural networks. Control and Decision, 2022, 37 (11): 2753- 2768. doi: 10.13195/j.kzyjc.2021.1241
18	DEVLIN J, CHANG M W, LEE K. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2024-01-02]. https://arxiv.org/pdf/1810.04805.
19	GUO M H , XU T X , LIU J J , et al. Attention mechanisms in computer vision: a survey. Computational Visual Media, 2022, 8 (3): 331- 368. doi: 10.1007/s41095-022-0271-y
20	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. [2024-01-02]. https://arxiv.org/abs/1810.04805.
21	CHEN K, WANG J, CHEN L C, et al. ABC-CNN: an attention based convolutional neural network for visual question answering[EB/OL]. [2024-01-02]. https://arxiv.org/abs/1511.05960.
22	XU X , WANG T , YANG Y , et al. Cross-modal attention with semantic consistence for image-text matching. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31 (12): 5412- 5425. doi: 10.1109/TNNLS.2020.2967597
23	GUAN S Y, LOEW M. Evaluation of generative adversarial network performance based on direct analysis of generated images[C]//Proceedings of the IEEE Applied Imagery Pattern Recognition Workshop. Washington D. C., USA: IEEE Press, 2019: 1-5.
24	OBUKHOV A, KRASNYANSKIY M. Quality assessment method for GAN based on modified metrics inception score and Fréchet inception distance[M]//RADEK S, PETR S, ZDENKA P. Software engineering perspectives in intelligent systems. Berlin, Germany: Springer, 2020: 102-114.
25	张佳, 张丽红. 基于条件增强和注意力机制的文本生成图像方法. 测试技术学报, 2023, 37 (2): 112- 119. doi: 10.3969/j.issn.1671-7449.2023.02.004
	ZHANG J , ZHANG L H . Research on text to image based on conditioning augmentation and attention mechanism. Journal of Test and Measurement Technology, 2023, 37 (2): 112- 119. doi: 10.3969/j.issn.1671-7449.2023.02.004
26	ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-attention generative adversarial networks[EB/OL]. [2024-01-02]. https://arxiv.org/abs/1805.08318.

[1]	ZHAO Hong, SONG Furong, LI Wengai. Research on Image Adversarial Example Generation Method Based on SE-AdvGAN [J]. Computer Engineering, 2025, 51(2): 300-311.
[2]	Li MIN, Bingjie DONG, Dong AN. Semantic Segmentation Algorithm Based on Multi-Attention Mechanism and Cross-Feature Fusion [J]. Computer Engineering, 2024, 50(8): 282-289.
[3]	MA Mingxu, MA Hong, SONG Huawei. Pose Estimation Algorithm for Small Target Pedestrians in Urban Street View Based on YOLO-Pose [J]. Computer Engineering, 2024, 50(4): 177-186.
[4]	DU Tiantian, WANG Xiaolong, HE Jing. Optical-flow-based Waterway Velocity Detection Algorithm Under Complex Illumination Conditions [J]. Computer Engineering, 2024, 50(4): 60-67.
[5]	Hong XIE, Wengang JIANG. RRA-InceptionV3 Combined Robust Sparse Representation Method for Expression Recognition [J]. Computer Engineering, 2023, 49(7): 196-203.
[6]	Xin DENG, Zhaohui LIU, Yan OUYANG, Jianhua CHEN. Encrypted Malicious Traffic Identification Based on CNN CBAM-BiGRU Attention [J]. Computer Engineering, 2023, 49(11): 178-186.
[7]	GUO Mengyan, ZHANG Juan, LIU Qiaohong, CAI Lizhi. Image Dehazing Algorithm Based on Recurrent Generative Adversarial Network [J]. Computer Engineering, 2022, 48(3): 280-287.
[8]	LUO Siqing, ZHANG Zhichao, YUE Qi. Semantic Image Segmentation Based on Improved SEGNET Model [J]. Computer Engineering, 2021, 47(4): 256-261.
[9]	GU Yan, ZHAO Chongyu, HUANG Ping. Deep Hash Learning Model Based on High-Order Statistical Information [J]. Computer Engineering, 2020, 46(7): 260-267,276.

Please choose a citation manager

Content to export