作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 179-188. doi: 10.19678/j.issn.1000-3428.0070181

• 计算机视觉与图形图像处理 • 上一篇    下一篇

基于自注意力机制和动态掩膜机制的文物图像修复方法

胡康源, 郭涛*(), 穆楠   

  1. 四川师范大学计算机科学学院, 四川 成都 610000
  • 收稿日期:2024-07-26 修回日期:2024-09-18 出版日期:2026-06-15 发布日期:2024-12-20
  • 通讯作者: 郭涛
  • 作者简介:

    胡康源, 男, 硕士研究生, 主研方向为图像修复

    郭涛(通信作者), 教授

    穆楠, 副教授

  • 基金资助:
    国家自然科学基金青年科学基金项目(11905153)

Image Inpainting Method for Cultural Relics Based on Self-Attention Mechanism and Dynamic Masking Mechanism

HU Kangyuan, GUO Tao*(), MU Nan   

  1. School of Computer Science, Sichuan Normal University, Chengdu 610000, Sichuan, China
  • Received:2024-07-26 Revised:2024-09-18 Online:2026-06-15 Published:2024-12-20
  • Contact: GUO Tao

摘要:

卷积网络在文物修复中由于卷积核的局部感受野对于全局上下文和复杂结构的理解较弱, 又因卷积操作的平移不变性对文物表面复杂的几何形态处理不充分, 在进行文物图像修复时容易出现无关结构和伪影等问题。具有自注意力机制的Transformer模型在处理文物图像的细节和局部特征时, 对特定区域的细节关注不足, 难以获取足够的深层特征, 从而影响修复的精度和细腻度, 对图像的远距离语义获取不充分, 导致修复图像的直观视觉性不足。提出了一种基于SwinTransformer的文物图像修复模型DMSWT。该模型通过对网络中的自注意力模块进行多项改进以优化网络结构。首先删除层归一化, 且用残差连接替换全连接层, 提高网络的深层特征提取能力; 其次引入动态掩膜机制, 缓解修复大规模缺失图像时默认采样造成的有效像素减少的问题; 最后改进损失函数, 注重直观性感受, 提高修复图像的直观视觉性。在不同场景下修复的实验结果表明, DMSWT模型能够学习到更多的结构先验信息, 并生成符合现实直觉的修复图像, 且在定量评估下指标有明显提高。

关键词: 文物图像修复, 深度学习, 自注意力机制, 卷积网络, 掩膜机制

Abstract:

When convolutional networks are used in the image inpainting of cultural relics, the convolution kernel's limited receptive field poses challenges, which results in a weak comprehension of the global context and complex structures. Moreover, the convolution operation does not adequately handle the intricate geometrical shapes of relic surfaces owing to its translation invariance; hence, convolution-based inpainting is prone to irrelevant structures and artifacts. In the case of Transformer models with self-attention mechanisms, which process the details and local features of relic images, the insufficient attention to specific regions makes it difficult to capture the deep features necessary for precise and detailed inpainting. Additionally, Transformers cannot adequately capture long-range semantics, which results in a suboptimal visual quality of the inpainted images. This paper proposes a relic image inpainting model based on the SwinTransformer, called the Dynamic Mask on SwinTransformer (DMSWT). The model introduces several improvements to the self-attention module within the network to optimize its structure. First, layer normalization is removed, and fully connected layers are replaced with residual connections to enhance the deep feature extraction capabilities of the network. Second, a dynamic mask mechanism is introduced to mitigate the issue of reduced effective pixels caused by default sampling in the inpainting of images with large-scale missing regions. Finally, the loss function is improved with a focus on enhancing the perceptual realism, leading to an improvement in the visual quality of the inpainted images. Experimental results for different scenarios show that the DMSWT model can learn more structural prior information and generate inpainted images that align with real-world intuition. Additionally, quantitative evaluations demonstrate significant improvements in performance metrics.

Key words: cultural relics image restoration, deep learning, self-attention mechanism, convolutional network, masking mechanism