Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Image Inpainting Method for Cultural Relics Based on Self-attention Mechanism and Dynamic Masking Mechanism

  

  • Published:2024-12-20

基于自注意力机制和动态掩膜机制的文物图像修复方法

Abstract: The convolutional network, when applied to cultural relic inpainting, faces challenges due to the convolution kernel's limited receptive field, resulting in weak comprehension of global context and complex structures. Moreover, the translation invariance of the convolution operation does not adequately handle the intricate geometrical shapes of relic surfaces, making convolution-based inpainting prone to irrelevant structures and artifacts. Transformer models with self-attention mechanisms, while processing the details and local features of relic images, often suffer from insufficient attention to specific regions, making it difficult to capture deep features necessary for precise and detailed inpainting. Additionally, Transformers struggle to capture long-range semantics, resulting in suboptimal visual quality in the inpainted images. This paper proposes a relic image inpainting model based on the Swin Transformer, called Dynamic Mask on Swin Transformer (DMSWT). The model introduces several improvements to the self-attention module within the network to optimize its structure. First, layer normalization is removed, and fully connected layers are replaced with residual connections to enhance the network's deep feature extraction capabilities. Second, a dynamic mask mechanism is introduced to mitigate the issue of reduced effective pixels caused by default sampling when inpainting images with large-scale missing regions. Finally, the loss function is improved with a focus on enhancing perceptual realism, thereby improving the visual quality of the inpainted images. Experimental results in different scenarios show that the DMSWT model can learn more structural prior information and generate inpainted images that align with real-world intuition. Additionally, quantitative evaluations demonstrate significant improvements in performance metrics.

摘要: 卷积网络在文物修复中由于卷积核的局部感受野对于全局上下文和复杂结构的理解较弱,又因卷积操作的平移不变性对文物表面复杂的几何形态处理不充分,在进行文图图像修复时容易出现无关结构和伪影等问题。具有自注意力机制的Transformer模型在处理文物图像的细节和局部特征时,对特定区域的细节关注不足,难以获取足够的深层特征,从而影响修复的精度和细腻度,对图像的远距离语义获取不充分,导致修复图像的直观视觉性不足。本文提出了一种基于SwinTransformer的文物图像修复模型(Dynamic mask on SwinTransformer,DMSWT)。该模型通过对网络中的自注意力模块进行多项改进,以优化网络结构。首先是删除层归一化,且用残差连接替换全连接层,提高网络的深层特征提取能力;其次是引入动态掩膜机制,缓解修复大规模缺失图像时默认采样造成的有效像素减少的问题;最后改进损失函数,注重直观性感受,提高修复图像的直观视觉性。通过在不同场景下修复的实验结果表明,DMSWT模型能够学习到更多的结构先验信息并生成符合现实直觉的修复图像,且在定量评估下指标有明显提高。