作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于多尺度自注意力 Transformer 的医学图像分割方法

  • 发布日期:2025-03-20

Medical Image Segmentation Based on Multi-scale Self-attention Transformer

  • Published:2025-03-20

摘要: 近年来,Transformer 模型凭借其出色的全局信息捕获能力和强大的表示能力,在医学图像分割领域被广泛应用,并 取得了显著成效。然而,这些方法在对图像进行序列化时,会将图像分割成固定大小的块,仅提取单一尺度的全局特征,一 定程度上割裂了图像的语义特征,最终导致分割精度不佳。鉴于此,我们提出了一种多尺度自注意力 Transformer 架构 (MultiFormer)。该架构首先采用连续卷积和下采样模块对图像进行处理,然后使用多尺度卷积投影模块来替换原来 1*1 的 投影模块,最后在自注意力模块生成的特征图中引入了可变形卷积。与传统 Transformer 图像序列化过程相比,这种连续卷积 在产生相同分辨率的特征时,有效增大了感受野,保留 2D 图像的空间相关性,避免固定位置和大小的图像块带来的语义信 息损失;其次,多尺度卷积投影模块利用四种不同大小的卷积核捕捉图像中的上下文信息,并通过通道拼接实现多尺度特征 的融合,反应了不同尺度下局部与局部间的注意力,而不仅限于单一尺度,实现了模型对不同尺度语义信息的聚合能力,进 一步减轻了语义割裂的问题。此外,可变形卷积通过引入一个额外的卷积层来学习并生成偏移场,使得卷积核能够灵活调整 其形状,以适应图像中形态多变的病变或器官,提升模型对复杂医学图像的处理能力。将该模块分别插入到 SETR、TransUnet、 TransFuse 三个网络结构中,并在 ACDC 心脏数据集和 ISIC2018 皮肤病变数据集上进行实验,结果表明 Dice 系数分别提升了 3.63%、1.06%、2.30%和 1.22%、2.31%、3.01%。MultiFormer 具有即插即用性,能够轻松地集成到各种下游医学图像分析任 务中。

Abstract: In recent years, Transformer models have been widely applied in the field of medical image segmentation due to their outstanding global information capture capabilities and strong representation power, achieving significant success. However, these methods divide images into fixed-size patches during serialization, extracting only single-scale global features. This process somewhat fragments the semantic features of the images, leading to poor segmentation accuracy. To address this issue, this paper proposes a Multi-Scale Self-Attention Transformer architecture (MultiFormer). The architecture first processes images using sequential convolution and downsampling modules. Then, it replaces the original 1x1 projection module with a multi-scale convolutional projection module. Finally, deformable convolution is introduced into the feature maps generated by the self-attention module. Compared to the traditional Transformer image serialization process, this sequential convolution effectively enlarges the receptive field while producing features of the same resolution, preserving the spatial correlation of 2D images and avoiding semantic information loss caused by fixed-position and fixed-size patches. Additionally, the multi-scale convolutional projection module captures contextual information in images using four different sizes of convolutional kernels and fuses multi-scale features through channel concatenation, reflecting local-to-local interactions at varying scales rather than being limited to a single scale, enhancing the model's ability to aggregate semantic information of different scales and further alleviating the problem of semantic fragmentation. Moreover, deformable convolution introduces an additional convolutional layer to learn and generate an offset field, allowing the convolutional kernel to flexibly adjust its shape to adapt to morphologically diverse lesions or organs in the images, thereby improving the model's ability to handle complex medical images. This module is inserted into the SETR, TransUnet, and TransFuse network architectures and tested on the ACDC cardiac dataset and the ISIC2018 skin lesion dataset. The results show that the Dice coefficient increases by 3.63%, 1.06%, 2.30%, and 1.22%, 2.31%, 3.01%, respectively. MultiFormer is plug-and-play, enabling easy integration into various downstream medical image analysis tasks