一种基于邻域注意力的扩散模型训练方法研究

doi:10.19678/j.issn.1000-3428.0068793

计算机工程 ›› 2025, Vol. 51 ›› Issue (8): 262-269. doi: 10.19678/j.issn.1000-3428.0068793

一种基于邻域注意力的扩散模型训练方法研究

姬莉霞¹^,², 周洪鑫¹, 肖士杰¹, 陈允峰³, 张晗¹^,*()

1. 郑州大学网络空间安全学院，河南郑州 450000
2. 四川大学计算机学院，四川成都 610065
3. 河南众诚信息科技股份有限公司，河南郑州 450000

收稿日期:2023-11-08 修回日期:2024-03-19 出版日期:2025-08-15 发布日期:2025-08-15
通讯作者: 张晗
基金资助:
河南省重大科技专项(231100210200); 河南省重点研发与推广专项(232102210128)

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

JI Lixia¹^,², ZHOU Hongxin¹, XIAO Shijie¹, CHEN Yunfeng³, ZHANG Han¹^,*()

1. School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China
2. College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China
3. Henan Cocyber Information and Technology Co., Ltd., Zhengzhou 450000, Henan, China

Received:2023-11-08 Revised:2024-03-19 Online:2025-08-15 Published:2025-08-15
Contact: ZHANG Han

摘要/Abstract

摘要：

生成扩散模型是一种能够学习数据生成过程的模型，它可以根据输入的高斯噪声，逐步去噪生成新的数据样本，因此被广泛应用于图像生成领域。近期，有研究发现扩散模型性能取决于网络复杂度而非U-Net的归纳偏置，因此可以利用Transformer作为骨干网络，使得扩散模型能够继承对Transformer的最新研究成果。然而，Transformer的引入又会导致模型体积增大、训练速度减慢。针对使用Transformer骨干网络的扩散模型训练速度慢、生成的图像细节信息不佳的问题，提出一种基于邻域注意力架构的扩散模型，该模型引入带有邻域注意力的Transformer骨干网络，利用邻域注意力机制的稀疏全局注意力模式，指数级扩展了扩散模型对图像的感知范围，使模型用较小的代价关注到了全局信息。通过注意力扩展层的渐进式膨胀变化，在模型训练阶段捕获到更多的视觉信息，使得模型生成的图像具有更好的全局性。实验结果表明，所提模型生成的图像有较优的全局细节，生成效果优于当前的SOTA模型。

关键词: 扩散模型, 图像生成, 生成式模型, 邻域注意力, Transformer骨干网络

Abstract:

Generative diffusion models can learn to generate data. They progressively denoise and generate new data samples based on input Gaussian noise; therefore, they are widely applied in the field of image generation. Recently, the inductive bias provided by the U-Net backbone used in diffusion models has been revealed to be non-critical, and the Transformer can be adopted as the backbone network to inherit the latest advancements from other domains. However, introducing the Transformer increases the model size and slows the training. To address the issues of slow training and inadequate image detail associated with diffusion models utilizing the Transformer backbone, this paper introduces a diffusion model based on a neighborhood attention architecture. This model incorporates a Transformer backbone network with neighborhood attention, utilizes the sparse global attention pattern of the neighborhood attention mechanism, which exponentially expands the model′s perception range of images, and focuses on global information at a lower cost. By employing progressive expansion in the attention expansion layer, more visual information is captured during model training, resulting in images with better global aspects. Experimental results demonstrate that this design provides better global consistency, yields superior global details in the generated images, and outperforms current State-Of-The-Art (SOTA) models.

Key words: diffusion model, image generation, generative model, neighborhood attention, Transformer backbone network

姬莉霞, 周洪鑫, 肖士杰, 陈允峰, 张晗. 一种基于邻域注意力的扩散模型训练方法研究[J]. 计算机工程, 2025, 51(8): 262-269.

JI Lixia, ZHOU Hongxin, XIAO Shijie, CHEN Yunfeng, ZHANG Han. A Research on Training Method for Diffusion Model Based on Neighborhood Attention[J]. Computer Engineering, 2025, 51(8): 262-269.

https://www.ecice06.com/CN/Y2025/V51/I8/262

图/表 10

图1 NADiT模型结构

Fig.1 Structure of NADiT model

图2 去噪扩散概率模型的结构

Fig.2 Structure of the denoising diffusion probability model

图3 NADiT、DiT-XL/2、MaskDiT模型训练速度与内存使用的对比

Fig.3 Comparison of training speed and memory usage among the NADiT, DiT-XL/2, MaskDiT models

图4 不同扩散模型在有指导情况下的生成性能

Fig.4 The generative performance of different diffusion models with guidance conditions

图5 不同模型训练时间与FID的关系

Fig.5 The relationship between training time and FID for different models

图6 ADM、LDM-4与NADiT对于“一只狗”的生成结果对比

Fig.6 Comparison of generated results for ″a dog″ among ADM, LDM-4, and NADiT

图7 不同模型对“油画”的重绘结果对比

Fig.7 Comparison of re-drawing results of ″oil painting″ by different models

参考文献 29

1	KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-10-03]. http://export.arxiv.org/pdf/1312.6114.
2	LI X, THICKSTUN J, GULRAJANI I, et al. Diffusion-lm improves controllable text generation[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l. ]: AAAI Press, 2022: 4328-4343.
3	HO J, CHAN W, SAHARIA C, et al. Imagen video: high definition video generation with diffusion models[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2210.02303.
4	闫志浩, 周长兵, 李小翠. 生成扩散模型研究综述. 计算机科学, 2024, 51 (1): 273- 283.
	YAN Z H , ZHOU C B , LI X C . Survey on generative diffusion model. Computer Science, 2024, 51 (1): 273- 283.
5	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.
6	李豪宇, 陈晔曜, 蒋志迪, 等. 基于子光场遮挡融合的无监督光场深度估计. 光电工程, 2024, 51 (10): 240166.
	LI H Y , CHEN Y Y , JIANG Z D , et al. Unsupervised light field depth estimation based on sub-light field occlusion fusion. Opto-Electronic Engineering, 2024, 51 (10): 240166.
7	PEEBLES W, XIE S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2023: 4195-4205.
8	MICHELUCCI U. An introduction to autoencoders[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2201.03898?context=cs.AI.
9	CAO Y, LI S, LIU Y, et al. A comprehensive survey of AI-Generated Content (AIGC): a history of generative AI from gan to ChatGPT[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.04226.
10	HINTON G E , SALAKHUTDINOV R R . Reducing the dimensionality of data with neural networks. science, 2006, 313 (5786): 504- 507. doi: 10.1126/science.1127647
11	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 2672-2680.
12	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l. ]: AAAI Press, 2020: 6840-6851.
13	SIDDIQUE N , PAHEDING S , ELKIN C P , et al. U-Net and its variants for medical image segmentation: a review of theory and applications. IEEE Access, 2021, 9, 82031- 82057. doi: 10.1109/ACCESS.2021.3086020
14	WU J , LIU W L , LI C , et al. A state-of-the-art survey of U-Net in microscopic image analysis: from simple usage to structure mortification. Neural Computing and Applications, 2023, 36, 3317- 3346.
15	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 10684-10695.
16	赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究. 电子与信息学报, 2023, 45 (12): 4371- 4381.
	ZHAO H , LI W G . Text-to-image generation model based on diffusion Wasserstein generative adversarial networks. Journal of Electronics & Information Technology, 2023, 45 (12): 4371- 4381.
17	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
18	LIU Y , ZHANG Y , WANG Y X , et al. A survey of visual transformers. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35 (6): 7478- 7498. doi: 10.1109/TNNLS.2022.3227717
19	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2103.00020?file=2103.00020.
20	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2204.06125.
21	DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2105.05233.
22	ZHENG H, NIE W, VAHDAT A, et al. Fast training of diffusion models with masked Transformers[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2306.09305.
23	HASSANI A, WALTON S, LI J, et al. Neighborhood attention transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 6185-6194.
24	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2009: 248-255.
25	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1706.08500.
26	BARRATT S, SHARMA R. A note on the inception score[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.
27	HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPscore: a reference-free evaluation metric for image captioning[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.
28	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 1-9.
29	GAO S, ZHOU P, CHENG M M, et al. Masked diffusion transformer is a strong image synthesizer[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.14389.

[1]	赵逸飞, 张俊华. 虚拟脊柱侧凸病例腰椎自动生成方法[J]. 计算机工程, 2024, 50(5): 323-329.
[2]	张美美, 秦品乐, 柴锐, 曾建潮, 翟双姣, 闫俊义, 冯二燕. 面向急性缺血性脑卒中的CT生成MRI算法[J]. 计算机工程, 2024, 50(2): 317-326.
[3]	陈子民, 关志涛. 基于条件扩散模型的图像分类对抗样本防御方法[J]. 计算机工程, 2024, 50(12): 296-305.
[4]	王靖尧, 曹敏. 基于文本的行人图像检索的多样化数据扩充方法[J]. 计算机工程, 2024, 50(12): 276-287.
[5]	殷歆, 张战成. 基于姿势引导与属性分解的人物图像生成[J]. 计算机工程, 2022, 48(11): 224-230,239.
[6]	柴梦婷, 朱远平. 生成式对抗网络研究与应用进展[J]. 计算机工程, 2019, 45(9): 222-234.
[7]	刘姝君,李艳婷. 基于深度高斯过程的多元类别数据分布估计[J]. 计算机工程, 2019, 45(2): 160-166.
[8]	谭励,张哲,杨明华,胡计鹏. 基于加权虚拟力的空中传感器网络分段部署算法[J]. 计算机工程, 2016, 42(10): 118-123.
[9]	张建明;刘霄;樊莉静. 基于三层虚拟图像生成的单样本人脸识别[J]. 计算机工程, 2010, 36(9): 187-189.
[10]	查那日苏, 何立强, 魏凤歧. 基于热扩散模型的测试程序分类[J]. 计算机工程, 2010, 36(11): 256-258,261.
[11]	张宁;杨树堂;陈健宁;陆松年. 一种以毒攻毒“益虫”的扩散模型研究[J]. 计算机工程, 2007, 33(06): 110-112.

选择文件类型/文献管理软件名称

选择包含的内容

一种基于邻域注意力的扩散模型训练方法研究

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 29

相关文章 11

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一种基于邻域注意力的扩散模型训练方法研究

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 29

相关文章 11

编辑推荐

Metrics

本文评价