作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (8): 262-269. doi: 10.19678/j.issn.1000-3428.0068793

• 图形图像处理 • 上一篇    下一篇

一种基于邻域注意力的扩散模型训练方法研究

姬莉霞1,2, 周洪鑫1, 肖士杰1, 陈允峰3, 张晗1,*()   

  1. 1. 郑州大学网络空间安全学院,河南 郑州 450000
    2. 四川大学计算机学院,四川 成都 610065
    3. 河南众诚信息科技股份有限公司,河南 郑州 450000
  • 收稿日期:2023-11-08 修回日期:2024-03-19 出版日期:2025-08-15 发布日期:2025-08-15
  • 通讯作者: 张晗
  • 基金资助:
    河南省重大科技专项(231100210200); 河南省重点研发与推广专项(232102210128)

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

JI Lixia1,2, ZHOU Hongxin1, XIAO Shijie1, CHEN Yunfeng3, ZHANG Han1,*()   

  1. 1. School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China
    2. College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China
    3. Henan Cocyber Information and Technology Co., Ltd., Zhengzhou 450000, Henan, China
  • Received:2023-11-08 Revised:2024-03-19 Online:2025-08-15 Published:2025-08-15
  • Contact: ZHANG Han

摘要:

生成扩散模型是一种能够学习数据生成过程的模型,它可以根据输入的高斯噪声,逐步去噪生成新的数据样本,因此被广泛应用于图像生成领域。近期,有研究发现扩散模型性能取决于网络复杂度而非U-Net的归纳偏置,因此可以利用Transformer作为骨干网络,使得扩散模型能够继承对Transformer的最新研究成果。然而,Transformer的引入又会导致模型体积增大、训练速度减慢。针对使用Transformer骨干网络的扩散模型训练速度慢、生成的图像细节信息不佳的问题,提出一种基于邻域注意力架构的扩散模型,该模型引入带有邻域注意力的Transformer骨干网络,利用邻域注意力机制的稀疏全局注意力模式,指数级扩展了扩散模型对图像的感知范围,使模型用较小的代价关注到了全局信息。通过注意力扩展层的渐进式膨胀变化,在模型训练阶段捕获到更多的视觉信息,使得模型生成的图像具有更好的全局性。实验结果表明,所提模型生成的图像有较优的全局细节,生成效果优于当前的SOTA模型。

关键词: 扩散模型, 图像生成, 生成式模型, 邻域注意力, Transformer骨干网络

Abstract:

Generative diffusion models can learn to generate data. They progressively denoise and generate new data samples based on input Gaussian noise; therefore, they are widely applied in the field of image generation. Recently, the inductive bias provided by the U-Net backbone used in diffusion models has been revealed to be non-critical, and the Transformer can be adopted as the backbone network to inherit the latest advancements from other domains. However, introducing the Transformer increases the model size and slows the training. To address the issues of slow training and inadequate image detail associated with diffusion models utilizing the Transformer backbone, this paper introduces a diffusion model based on a neighborhood attention architecture. This model incorporates a Transformer backbone network with neighborhood attention, utilizes the sparse global attention pattern of the neighborhood attention mechanism, which exponentially expands the model′s perception range of images, and focuses on global information at a lower cost. By employing progressive expansion in the attention expansion layer, more visual information is captured during model training, resulting in images with better global aspects. Experimental results demonstrate that this design provides better global consistency, yields superior global details in the generated images, and outperforms current State-Of-The-Art (SOTA) models.

Key words: diffusion model, image generation, generative model, neighborhood attention, Transformer backbone network