目标检测中注意力机制综述

doi:10.19678/j.issn.1000-3428.0068553

摘要/Abstract

摘要：

Transformer在自然语言处理中表现出优越的性能激励了研究人员开始探索其在计算机视觉任务中的应用。基于Transformer的目标检测模型DETR将目标检测视为一个集合预测问题, 引入Transformer模型来解决目标检测任务, 从而避免了传统方法中的提案生成和后处理步骤。最初的DETR在训练收敛和小物体检测方面存在速度慢、效率低的问题。为了解决这些问题, 研究人员进行了多方面改进, 提升了DETR的性能。对DETR的基本模块和增强模块进行深入研究, 包括对主干结构的修改、查询设计策略和注意力机制的改进, 同时对各种检测器进行比较分析, 评估它们的性能和网络架构, 探讨了DETR在计算机视觉任务中的潜力和应用前景以及目前存在的局限性和面临的挑战, 并对相关模型进行分析与总结。根据目标检测发展的现状, 分析注意力模型的优势与局限性, 并对注意力模型在目标检测领域的研究方向加以展望。

关键词: 注意力机制, 计算机视觉, 深度学习, DETR模型, 目标检测

Abstract:

The superior performance of Transformer in natural language processing has inspired researchers to explore their applications in computer vision tasks. The Transformer-based object detection model, Detection Transformer (DETR), treats object detection as a set prediction problem, introducing the Transformer model to address this task and eliminating the proposal generation and post-processing steps that are typical of traditional methods. The original DETR model encounters issues related to slow training convergence and inefficiency in detecting small objects. To address these challenges, researchers have implemented various improvements to enhance DETR performance. This study conducts an in-depth investigation of both the basic and enhanced modules of DETR, including modifications to the backbone architecture, query design strategies, and improvements to the attention mechanism. Furthermore, it provides a comparative analysis of various detectors and evaluates their performance and network architecture. The potential and application prospects of DETR in computer vision tasks are discussed herein, along with its current limitations and challenges. Finally, this study analyzes and summarizes related models, assesses the advantages and limitations of attention models in the context of object detection, and outlines future research directions in this field.

Key words: attention mechanism, computer vision, deep learning, DETR model, object detection

任书玉, 汪晓丁, 林晖. 目标检测中注意力机制综述[J]. 计算机工程, 2024, 50(12): 16-32.

REN Shuyu, WANG Xiaoding, LIN Hui. Review of Attention Mechanisms in Object Detection[J]. Computer Engineering, 2024, 50(12): 16-32.

https://www.ecice06.com/CN/Y2024/V50/I12/16

图/表 25

图1 SETR模型结构

Fig.1 Architecture of the SETR model

图2 SegFormer模型结构

Fig.2 Architecture of the SegFormer model

图3 iGPT模型结构

Fig.3 Architecture of the iGPT model

图4 CogView模型结构

Fig.4 Architecture of the CogView model

图5 HiT模型结构

Fig.5 Architecture of the HiT model

图6 Swin BERT模型结构

Fig.6 Architecture of the Swin BERT model

图7 DETR模型结构

Fig.7 Architecture of the DETR model

图8 Deformable DETR模型的可变形注意力模块

Fig.8 Deformable attention module of the Deformable DETR model

图9 Dynamic DETR模型结构

Fig.9 Architecture of the dynamic DETR model

图10 Efficient DETR模型结构

Fig.10 Architecture of the Efficient DETR model

图11 SMCA DETR模型结构

Fig.11 Architecture of the SMCA DETR model

图12 Sparse DETR模型结构

Fig.12 Architecture of the sparse DETR model

图13 RT DETR模型结构

Fig.13 Architecture of the RT DETR model

图14 Lite DETR模型结构

Fig.14 Architecture of the Lite DETR model

图15 SAM DETR模型语义匹配过程

Fig.15 Semantic matching process of the SAM DETR model

图16 CF DETR模型结构

Fig.16 Architecture of the CF DETR model

图17 DAB DETR模型的改进

Fig.17 Improvement of the DAB DETR model

图18 Conditional DETR模型Decoder部分

Fig.18 Decoder part of the conditional DETR model

图19 DINO模型结构

Fig.19 Architecture of the DINO model

图20 UP DETR模型的随机查询

Fig.20 Random query of the UP DETR model

图21 DN DETR模型去噪和匹配

Fig.21 Denoising and matching of the DN DETR model

图22 Group DETR模型结构

Fig.22 Architecture of the Group DETR model

图23 Group DETR并行解码器

Fig.23 Parallel decoder of the Group DETR

参考文献 50

1	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 779-788.
2	LIU W , ANGUELOV D , ERHAN D , et al. SSD: single shot multibox detector. Berlin, Germany: Springer International Publishing, 2016.
3	REN S Q , HE K M , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
4	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 2980-2988.
5	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-09-15]. http://arxiv.org/abs/1804.02767.
6	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer International Publishing, 2020: 213-229.
7	LIU L , OUYANG W L , WANG X G , et al. Deep learning for generic object detection: a survey. International Journal of Computer Vision, 2020, 128 (2): 261- 318. doi: 10.1007/s11263-019-01247-4
8	ZAIDI S S A , ANSARI M S , ASLAM A , et al. A survey of modern deep learning based object detection models. Digital Signal Processing, 2022, 126, 103514. doi: 10.1016/j.dsp.2022.103514
9	ARKIN E, YADIKAR N, MUHTAR Y, et al. A survey of object detection based on CNN and transformer[C]//Proceedings of the 2nd International Conference on Pattern Recognition and Machine Learning (PRML). Washington D.C., USA: IEEE Press, 2021: 99-108.
10	KHAN S , NASEER M , HAYAT M , et al. Transformers in vision: a survey. ACM Computing Surveys, 2022, 54 (10): 1- 41. doi: 10.1145/3505244
11	ARKIN E , YADIKAR N , XU X B , et al. A survey: object detection methods from CNN to transformer. Multimedia Tools and Applications, 2023, 82 (14): 21353- 21383. doi: 10.1007/s11042-022-13801-3
12	CHAUDHARI S , MITHAL V , POLATKAN G , et al. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology, 2021, 12 (5): 1- 32. doi: 10.1145/3465055
13	李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述. 小型微型计算机系统, 2023, 44 (4): 850- 861. URL
	LI Q G , YANG X G , LU R T , et al. Transformer in computer vision: a survey. Journal of Chinese Computer Systems, 2023, 44 (4): 850- 861. URL
14	田永林, 王雨桐, 王建功, 等. 视觉Transformer研究的关键问题: 现状及展望. 自动化学报, 2022, 48 (4): 957- 979. doi: 10.16383/j.aas.c220027
	TIAN Y L , WANG Y T , WANG J G , et al. Key issues in visual Transformer research: current status and prospects. Acta Automatica Sinica, 2022, 48 (4): 957- 979. doi: 10.16383/j.aas.c220027
15	李建, 杜建强, 朱彦陈, 等. 基于Transformer的目标检测算法综述. 计算机工程与应用, 2023, 59 (10): 48- 64. doi: 10.3778/j.issn.1002-8331.2211-0133
	LI J , DU J Q , ZHU Y C , et al. Survey of Transformer-based object detection algorithms. Computer Engineering and Applications, 2023, 59 (10): 48- 64. doi: 10.3778/j.issn.1002-8331.2211-0133
16	刘宇晶. 基于Transformer的目标检测研究综述. 计算机时代, 2023, (5): 6- 10. doi: 10.19850/j.cnki.2096-4706.2021.07.004
	LIU Y J . Summary of research on target detection based on Transformer. Computer Era, 2023, (5): 6- 10. doi: 10.19850/j.cnki.2096-4706.2021.07.004
17	EVERINGHAM M , VAN GOOL L , WILLIAMS C K I , et al. The pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 2010, 88 (2): 303- 338. doi: 10.1007/s11263-009-0275-4
18	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1405.0312.
19	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.
20	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 770-778.
21	ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 6877-6886.
22	XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with Transformers[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2105.15203.
23	REED S, AKATA Z, YAN X C, et al. Generative adversarial text to image synthesis[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1605.05396.
24	ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 5908-5916.
25	ZHANG H , XU T , LI H S , et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (8): 1947- 1962. doi: 10.1109/TPAMI.2018.2856256
26	GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144. doi: 10.1145/3422622
27	CHEN M, RADFORD A, CHILD R, et al. Generative pretraining from pixels[C]//Proceedings of the International Conference on Machine Learning. Washington D.C., USA: IEEE Press, 2020: 1691-1703.
28	ESSER P, ROMBACH R, OMMER B. Taming Transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 12868-12878.
29	JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2102.07074.
30	DING M, YANG Z Y, HONG W Y, et al. CogView: mastering text-to-image generation via Transformers[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2105.13290.
31	VAN DEN OORD A, VINYALS O. Neural discrete representation learning[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1711.00937#:~:text=Neural%20Discrete%20Representation%20Learning.%20Aaron%20van%20den%20Oord.
32	LIU S, FAN H Q, QIAN S S, et al. HiT: hierarchical Transformer with momentum contrast for video-text retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 11895-11905.
33	LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 17928-17937.
34	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2014: 580-587.
35	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 764-773.
36	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable Transformers for end-to-end object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2010.04159.
37	DAI X Y, CHEN Y P, YANG J W, et al. Dynamic DETR: end-to-end object detection with dynamic attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 2968-2977.
38	YAO Z Y, AI J B, LI B X, et al. Efficient DETR: improving end-to-end object detector with dense prior[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2104.01318.
39	GAO P, ZHENG M H, WANG X G, et al. Fast convergence of DETR with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3601-3610.
40	ROH B, SHIN J, SHIN W, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2111.14330.
41	ZHAO Y A, LV W Y, XU S L, et al. DETRs beat YOLOs on real-time object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2304.08069.
42	LI F, ZENG A L, LIU S L, et al. Lite DETR: an interleaved multi-scale encoder for efficient DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 18558-18567.
43	ZHANG G J, LUO Z P, YU Y C, et al. Accelerating DETR convergence via semantic-aligned matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 939-948.
44	CAO X P, YUAN P, FENG B L, et al. CF-DETR: coarse-to-fine Transformers for end-to-end object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 185-193.
45	LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2201.12329.
46	MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3631-3640.
47	ZHANG H, LI F, LIU S L, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2203.03605.
48	DAI Z G, CAI B L, LIN Y G, et al. UP-DETR: unsupervised pre-training for object detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 1601-1610.
49	LI F, ZHANG H, LIU S L, et al. DN-DETR: accelerate DETR training by introducing query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13609-13617.
50	CHEN Q, CHEN X K, WANG J, et al. Group DETR: fast DETR training with group-wise one-to-many assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6610-6619.

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	李俊俊, 董建刚, 李坤. 基于Kubernetes的集群节能策略研究[J]. 计算机工程, 2024, 50(9): 82-91.
[3]	林畅, 郭伟, 任哲聪, 金海波. 基于Transformer的目标跟踪与分割统一算法[J]. 计算机工程, 2024, 50(9): 130-141.
[4]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[5]	王汝英, 马嘉骏, 董建强, 刘万龙, 张海涛, 尹凯, 赵博超. 基于MTS-BiGRU-DMHSA的工业负荷预测方法[J]. 计算机工程, 2024, 50(9): 169-178.
[6]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[7]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[8]	郭敏, 张熙涵, 李阳. 融合注意力的教师互一致性半监督医学图像分割[J]. 计算机工程, 2024, 50(9): 313-323.
[9]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[10]	曾钰琦, 刘博, 钟柏昌, 钟瑾. 智慧教育下基于改进YOLOv8的学生课堂行为检测算法[J]. 计算机工程, 2024, 50(9): 344-355.
[11]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[12]	饶日昕, 王怡文, 曾砺志, 童心恬, 赵海涛. 面向废旧电缆检测的轻量化网络模型[J]. 计算机工程, 2024, 50(8): 22-30.
[13]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[14]	王蕾, 党时鹏, 潘丰. 基于卷积神经网络的隐匿性旁路预测模型[J]. 计算机工程, 2024, 50(8): 40-49.
[15]	陈瀚, 赵春蕾, 蒋昊达, 王春东. 基于融合模型与语义网络的App用户意图识别研究[J]. 计算机工程, 2024, 50(8): 50-63.

选择文件类型/文献管理软件名称

选择包含的内容