作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

视觉语言模型驱动的双分支异常检测网络

  • 发布日期:2025-09-02

Dual-branch anomaly detection network driven by visual language model

  • Published:2025-09-02

摘要: 摘 要: 噪声干扰与低分辨率问题对特征表达的显著限制,可能导致关键细节丢失和语义信息退化,从而限制了模型在复杂场景下的鲁棒性与泛化能力。针对这一问题,构建了一个视觉语言模型驱动的双分支异常检测网络MSRA-CLIP(Multi scale and Residual Attention-CLIP)。首先,利用两个平行分支来处理图像,上分支设计了一个多尺度注意力的组合注意力单元,它在提高图像超分辨率质量的同时,平衡了计算复杂度和性能;下分支使用了包含残差注意力和跳跃连接的残差注意力模块,通过大量的残差注意力和跳跃连接捕获丰富的全局和局部特征,之后将两个分支处理后的图像特征进行拼接。最后,利用图像-文本多级对齐模块将处理后的图像特征映射到联合嵌入空间,然后与文本特征进行比较生成异常图。为了评估所提出的模型的有效性,在Brain MRI、LiverCT等5个医疗异常检测数据集上的实验结果表明,与MVFA相比,MSRA-CLIP在零样本设置下异常分类的平均AUC提高了5%,异常分割的平均AUC提高了1.1%,在少样本设置下异常分类的平均AUC提高了0.93%。

Abstract: 】Noise interference and low resolution degrade feature expression, causing key detail loss and semantic information degradation, which limits model robustness and generalization in complex scenes. To address this problem, a visual language model-driven dual-branch anomaly detection network MSRA-CLIP (Multi scale and Residual Attention-CLIP) was constructed.First, two parallel branches are used to process the image. The upper branch designs a combined attention unit of multi-scale attention, which balances computational complexity and performance while improving the quality of image super-resolution. The lower branch uses a residual attention module that includes residual attention and skip connections. Through a large number of residual attention and skip connections, rich global and local features are captured, and then the image features processed by the two branches are spliced. Finally, the processed image features are mapped to the joint embedding space using an image-text multi-level alignment module and then compared with the text features to generate anomaly maps. Experiments on five medical anomaly detection datasets (Brain MRI, Liver CT, etc.)demonstrate MSRA-CLIP's superiority over MVFA, with average AUC improvements of 5% in zero-shot anomaly classification, 1.1% in anomaly segmentation, and 0.93% in few-shot classification.