视觉语言模型驱动的双分支异常检测网络

doi:10.19678/j.issn.1000-3428.0252456

摘要/Abstract

摘要： 摘要: 噪声干扰与低分辨率问题对特征表达的显著限制，可能导致关键细节丢失和语义信息退化，从而限制了模型在复杂场景下的鲁棒性与泛化能力。针对这一问题，构建了一个视觉语言模型驱动的双分支异常检测网络MSRA-CLIP（Multi scale and Residual Attention-CLIP）。首先，利用两个平行分支来处理图像，上分支设计了一个多尺度注意力的组合注意力单元，它在提高图像超分辨率质量的同时，平衡了计算复杂度和性能；下分支使用了包含残差注意力和跳跃连接的残差注意力模块，通过大量的残差注意力和跳跃连接捕获丰富的全局和局部特征，之后将两个分支处理后的图像特征进行拼接。最后，利用图像-文本多级对齐模块将处理后的图像特征映射到联合嵌入空间，然后与文本特征进行比较生成异常图。为了评估所提出的模型的有效性，在Brain MRI、LiverCT等5个医疗异常检测数据集上的实验结果表明，与MVFA相比，MSRA-CLIP在零样本设置下异常分类的平均AUC提高了5%，异常分割的平均AUC提高了1.1%，在少样本设置下异常分类的平均AUC提高了0.93%。

Abstract: 】Noise interference and low resolution degrade feature expression, causing key detail loss and semantic information degradation, which limits model robustness and generalization in complex scenes. To address this problem, a visual language model-driven dual-branch anomaly detection network MSRA-CLIP (Multi scale and Residual Attention-CLIP) was constructed.First, two parallel branches are used to process the image. The upper branch designs a combined attention unit of multi-scale attention, which balances computational complexity and performance while improving the quality of image super-resolution. The lower branch uses a residual attention module that includes residual attention and skip connections. Through a large number of residual attention and skip connections, rich global and local features are captured, and then the image features processed by the two branches are spliced. Finally, the processed image features are mapped to the joint embedding space using an image-text multi-level alignment module and then compared with the text features to generate anomaly maps. Experiments on five medical anomaly detection datasets (Brain MRI, Liver CT, etc.)demonstrate MSRA-CLIP's superiority over MVFA, with average AUC improvements of 5% in zero-shot anomaly classification, 1.1% in anomaly segmentation, and 0.93% in few-shot classification.

余肖生, 李盛, 李松璞. 视觉语言模型驱动的双分支异常检测网络[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252456.

YU Xiaosheng , LI Sheng , LI Songpu. Dual-branch anomaly detection network driven by visual language model[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252456.

参考文献

[1] HOJJATI H,Ho T K K,ARMANFARD N.Self-supervised anomaly detection in computer vision and beyond: A survey and outlook[J].Neural Networks,2024:172.
[2] 吕承侃，沈飞，张正涛，等.图像异常检测研究现状综述[J]. 自动化学报，2022，48(6): 1402−1428. LV C K, SHEN F, ZHANG Z T, et al. Review of image anomaly detection[J]. Acta Automatica Sinica, 2022, 48(6): 1402−1428.
[3] 张安勤，丁志锋. 融合动态图嵌入和Transformer自编码器的网络异常检测[J]. 计算机工程，2025，51(4): 47-56. ZHANG Anqin, DING Zhifeng. Network Anomaly Detection Integrating Dynamic Graph Embedding and Transformer Autoencoder[J]. Computer Engineering, 2025, 51(4): 47-56.
[4] MADAN N , RISTEA N C , IONESCU R T,et al.Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(1):525-542.
[5] MEI S, YANG H, YIN Z. An unsupervised-learning-based approach for automated defect inspection on textured surfaces[J]. IEEE transactions on instrumentation and measurement, 2018, 67(6): 1266-1277.
[6] KWON G,PRABHUSHANKAR M, TEMEL D, et al. Backpropagated gradient representations for anomaly detection[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI. Springer, Cham, 2020: 206-226.
[7] GONG D, LIU L, LE V, et al. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1705-1714.
[8] JONGHEON J, YANG Z, TAEWAN K,et al. Winclip : Zero/few-shot anomaly classification and segmenta tion.[C]//CVPR 2023: 19606-19616.
[9] CHEN X , HAN Y , ZHANG J .A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD [EB/OL]. 2025.https://arxiv.org/abs/2305.17382.
[10] HUANG C , JIANG A , FENG J ,et al.Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images[C] //CVPR 2024.
[11] RADFORD A , KIM J W, HALLACYC, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.
[12] LI Z, WU X, DU H, et al. A Survey of State of the Art Large Vision Language Models : Alignment, Benchmark, Evaluations and Challenges[EB/OL]. 2025. arxiv:2501.02189.
[13] MORI Y, TAKAHASHI H, OKA R. Image-to-word transformation based on dividing and vector quantizing images with words[C]//First international workshop on multimedia intelligent storage and retrieval management. 1999: 2.
[14] FERGUS R, FEI-FEI L, PERONA P, et al. Learning object categories from Google's image search[C]//Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1. IEEE. IEEE Computer Society 1730 Massachusetts Ave., NW Washington, DCUnited States,2005: 1816-1823.
[15] SOCHER R, GANJOO M, MANNING C D, et al. Zero-shot learning through cross-modal transfer[C]//Advances in neural information processing systems, 2013:26.
[16] LU J, BATRA D, PARIKH D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]// Advances in neural information processing systems, 2019: 32.
[17] TEWEL Y, SHALEV Y, SCHWARTZ I, et al. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). 2022: 17918-17928.
[18] GOH G , CAMMARATA N , VOSS C ,et al.Multimodal Neurons in Artificial Neural Networks[J].Distill, 2021.
[19] TAORI R, DAVE A, SHANKAR V, et al. Measuring robustness to natural distribution shifts in image classification[C]//Advances in Neural Information Processing Systems, 2020: 18583-18599.
[20] 郭新茹,宋丽娟,朱文倩,等.面向低样本的工业图像异常检测综述[J].计算机工程与应用,2025: 1-21. GUO Xinru, SONG Lijuan, ZHU Wenqian,et al. A Review of Low-Shot Industrial Image Anomaly Detection[J]. Computer Engineering and Applications, 2025: 1-21.
[21] ZHOU Q, PANG G, TIAN Y, et al. Anomalyclip : Object-agnostic prompt learning for zero-shot anomaly detection[EB/OL]. 2025.arxiv:2310.18961.
[22] GU Z P, ZHU B, ZHU G, et al. FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization[EB/OL]. 2025. arxiv: 2501. 10067.
[23] HE H, SIU W C. Single image super-resolution using Gaussian process regression[C]// The 24th IEEE Conference on Computer Vision and Pattern Recognition.CVPR 2011.Colorado Springs, CO, USA, 2011: 449-456.
[24] SUN J, XU Z, SHUM H Y. Image super-resolution using gradient profile prior[C]//2008 IEEE conference on computer vision and pattern recognition. Piscataway, NJ: IEEE, 2008: 1-8.
[25] TAI Y W, LIU S, BROWN M S, et al. Super resolution using edge prior and single image detail synthesis[C]// 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2400-2407.
[26] WANG Y, LIU Y, ZHAO S, et al. CAMixerSR: only details need more "attention"[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 25837-25846.
[27] Dong C, Loy C C, He K, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(2): 295-307.
[28] KIM J, LEE J K, LEE K M. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 1646-1654.
[29] LIM B, SON S, KIM H, et al. Enhanced deep residual networks for single image super-resolution[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2017: 1132-1140.
[30] ZHANG Y , LI K, LI K , et al. Image super-resolution using very deep residual channel attention networks[C]//Proceedings of the European Conference on Computer Vision (ECCV). Berlin: Springer, 2018: 294-310.
[31] DAI T, CAI J, ZHANG Y, et al. Second-order attention network for single image super-resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 11065-11074.
[32] WU W, LIU S, XIA Y, et al. Dual residual attention network for image denoising[J]. Pattern Recognition, 2024, 149: 110291.
[33] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2009: 248-255.
[34] BAID U, GHODASARA S, MOHAN S, et al. The RSNA–ASNR–MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification[EB/OL]. 2025. arxiv: 2107.02314.
[35] BAKAS S, AKBARI H, SOTIRAS A, et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features[J]. Scientific Data, 2017, 4(1): 170117.
[36] MENZE B H, JAKAB A, BAUER S, et al. The multimodal brain tumor image segmentation benchmark (BRATS)[J]. IEEE Transactions on Medical Imaging, 2015, 34(10): 1993-2024.
[37] BILIC P, CHRIST P, LI H B, et al. The liver tumor segmentation benchmark (LiTS)[J]. Medical Image Analysis, 2023, 84: 102680.
[38] LANDMAN B, XU Z, IGELSIAS J, et al. MICCAI multi-atlas labeling beyond the cranial vault—workshop and challenge[C]//Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault Workshop Challenge. 2015: 12.
[39] HU J, CHEN Y, YI Z. Automated segmentation of macular edema in OCT using deep neural networks[J]. Medical Image Analysis, 2019, 55: 216-227.
[40] KERMANY D S, GOLDBAUM M, CAI W, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning[J]. Cell, 2018, 172(5): 1122-1131. [41] BEJNORDI B E, VETA M, VAN DIEST P J, et al. Diagnostic assess
ment of deep learning algorithms for detection of lymph node metastases in women with breast cancer[J]. JAMA, 2017, 318(22): 2199-2210. [42] DEFARD T, SETKOV A, LOESCH A, et al. PaDiM: a patch distribution modeling framework for anomaly detection and localization[C]//International Conference on Pattern Recognition. Cham: Springer International Publishing, 2020: 475-489.
[43] ROTH K, PEMULA L, ZEPEDA J, et al. Towards total recall in industrial anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 14298-14308.
[44] YI J, YOON S. Patch SVDD: patch-level SVDD for anomaly detection and segmentation[C]//Proceedings of the Asian Conference on Computer Vision. Berlin: Springer, 2020: 375-390.
[45] Guo K , Pan T , Jiang C ,et al.SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images [EB/OL]. 2025. arxiv: 2107.02314.
[46] DING C , PANG G, SHEN C . Catching both gray and black swans: open-set supervised anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 7378-7388.
[47] YAO X C, LI R Q, ZHANG J, et al. Explicit boundary guided semi-push-pull contrastive learning for supervised anomaly detection[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 24490-24499.
[48] ZHANG X , XU M , QIU D ,et al. Mediclip: Adapting clip for few-shot medical image anomaly detection.[C]//In: MICCAI, Springer, Cham,2024: 458–468 (2024).
[49] GUDOVSKIY D A, ISHIZAKA S, KOZUKA K. CFlow - AD: real-time unsupervised anomaly detection with localization via conditional normalizing flows [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE, 2022: 1819-1828.
[50] DENG H, LI X. Anomaly detection via reverse distillation from one-class embedding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 9727-9736.
[51] ROTH K, PEMULA L, ZEPEDA J, et al. Towards total recall in industrial anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 14298-14308.
[52] SALEHI M, SADJADI N, BASELIZADEH S, et al. Multiresolution knowledge distillation for anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14902-14912.

选择文件类型/文献管理软件名称

选择包含的内容