Review of Attention Mechanisms in Object Detection

doi:10.19678/j.issn.1000-3428.0068553

Abstract

Abstract:

The superior performance of Transformer in natural language processing has inspired researchers to explore their applications in computer vision tasks. The Transformer-based object detection model, Detection Transformer (DETR), treats object detection as a set prediction problem, introducing the Transformer model to address this task and eliminating the proposal generation and post-processing steps that are typical of traditional methods. The original DETR model encounters issues related to slow training convergence and inefficiency in detecting small objects. To address these challenges, researchers have implemented various improvements to enhance DETR performance. This study conducts an in-depth investigation of both the basic and enhanced modules of DETR, including modifications to the backbone architecture, query design strategies, and improvements to the attention mechanism. Furthermore, it provides a comparative analysis of various detectors and evaluates their performance and network architecture. The potential and application prospects of DETR in computer vision tasks are discussed herein, along with its current limitations and challenges. Finally, this study analyzes and summarizes related models, assesses the advantages and limitations of attention models in the context of object detection, and outlines future research directions in this field.

Key words: attention mechanism, computer vision, deep learning, DETR model, object detection

摘要：

Transformer在自然语言处理中表现出优越的性能激励了研究人员开始探索其在计算机视觉任务中的应用。基于Transformer的目标检测模型DETR将目标检测视为一个集合预测问题, 引入Transformer模型来解决目标检测任务, 从而避免了传统方法中的提案生成和后处理步骤。最初的DETR在训练收敛和小物体检测方面存在速度慢、效率低的问题。为了解决这些问题, 研究人员进行了多方面改进, 提升了DETR的性能。对DETR的基本模块和增强模块进行深入研究, 包括对主干结构的修改、查询设计策略和注意力机制的改进, 同时对各种检测器进行比较分析, 评估它们的性能和网络架构, 探讨了DETR在计算机视觉任务中的潜力和应用前景以及目前存在的局限性和面临的挑战, 并对相关模型进行分析与总结。根据目标检测发展的现状, 分析注意力模型的优势与局限性, 并对注意力模型在目标检测领域的研究方向加以展望。

关键词: 注意力机制, 计算机视觉, 深度学习, DETR模型, 目标检测

REN Shuyu, WANG Xiaoding, LIN Hui. Review of Attention Mechanisms in Object Detection[J]. Computer Engineering, 2024, 50(12): 16-32.

任书玉, 汪晓丁, 林晖. 目标检测中注意力机制综述[J]. 计算机工程, 2024, 50(12): 16-32.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068553

https://www.ecice06.com/EN/Y2024/V50/I12/16

Figures/Tables 25

Fig.1 Architecture of the SETR model

Fig.2 Architecture of the SegFormer model

Fig.3 Architecture of the iGPT model

Fig.4 Architecture of the CogView model

Fig.5 Architecture of the HiT model

Fig.6 Architecture of the Swin BERT model

Fig.7 Architecture of the DETR model

Fig.8 Deformable attention module of the Deformable DETR model

Fig.9 Architecture of the dynamic DETR model

Fig.10 Architecture of the Efficient DETR model

Fig.11 Architecture of the SMCA DETR model

Fig.12 Architecture of the sparse DETR model

Fig.13 Architecture of the RT DETR model

Fig.14 Architecture of the Lite DETR model

Fig.15 Semantic matching process of the SAM DETR model

Fig.16 Architecture of the CF DETR model

Fig.17 Improvement of the DAB DETR model

Fig.18 Decoder part of the conditional DETR model

Fig.19 Architecture of the DINO model

Fig.20 Random query of the UP DETR model

Fig.21 Denoising and matching of the DN DETR model

Fig.22 Architecture of the Group DETR model

Fig.23 Parallel decoder of the Group DETR

References 50

1	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 779-788.
2	LIU W , ANGUELOV D , ERHAN D , et al. SSD: single shot multibox detector. Berlin, Germany: Springer International Publishing, 2016.
3	REN S Q , HE K M , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
4	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 2980-2988.
5	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-09-15]. http://arxiv.org/abs/1804.02767.
6	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer International Publishing, 2020: 213-229.
7	LIU L , OUYANG W L , WANG X G , et al. Deep learning for generic object detection: a survey. International Journal of Computer Vision, 2020, 128 (2): 261- 318. doi: 10.1007/s11263-019-01247-4
8	ZAIDI S S A , ANSARI M S , ASLAM A , et al. A survey of modern deep learning based object detection models. Digital Signal Processing, 2022, 126, 103514. doi: 10.1016/j.dsp.2022.103514
9	ARKIN E, YADIKAR N, MUHTAR Y, et al. A survey of object detection based on CNN and transformer[C]//Proceedings of the 2nd International Conference on Pattern Recognition and Machine Learning (PRML). Washington D.C., USA: IEEE Press, 2021: 99-108.
10	KHAN S , NASEER M , HAYAT M , et al. Transformers in vision: a survey. ACM Computing Surveys, 2022, 54 (10): 1- 41. doi: 10.1145/3505244
11	ARKIN E , YADIKAR N , XU X B , et al. A survey: object detection methods from CNN to transformer. Multimedia Tools and Applications, 2023, 82 (14): 21353- 21383. doi: 10.1007/s11042-022-13801-3
12	CHAUDHARI S , MITHAL V , POLATKAN G , et al. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology, 2021, 12 (5): 1- 32. doi: 10.1145/3465055
13	李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述. 小型微型计算机系统, 2023, 44 (4): 850- 861. URL
	LI Q G , YANG X G , LU R T , et al. Transformer in computer vision: a survey. Journal of Chinese Computer Systems, 2023, 44 (4): 850- 861. URL
14	田永林, 王雨桐, 王建功, 等. 视觉Transformer研究的关键问题: 现状及展望. 自动化学报, 2022, 48 (4): 957- 979. doi: 10.16383/j.aas.c220027
	TIAN Y L , WANG Y T , WANG J G , et al. Key issues in visual Transformer research: current status and prospects. Acta Automatica Sinica, 2022, 48 (4): 957- 979. doi: 10.16383/j.aas.c220027
15	李建, 杜建强, 朱彦陈, 等. 基于Transformer的目标检测算法综述. 计算机工程与应用, 2023, 59 (10): 48- 64. doi: 10.3778/j.issn.1002-8331.2211-0133
	LI J , DU J Q , ZHU Y C , et al. Survey of Transformer-based object detection algorithms. Computer Engineering and Applications, 2023, 59 (10): 48- 64. doi: 10.3778/j.issn.1002-8331.2211-0133
16	刘宇晶. 基于Transformer的目标检测研究综述. 计算机时代, 2023, (5): 6- 10. doi: 10.19850/j.cnki.2096-4706.2021.07.004
	LIU Y J . Summary of research on target detection based on Transformer. Computer Era, 2023, (5): 6- 10. doi: 10.19850/j.cnki.2096-4706.2021.07.004
17	EVERINGHAM M , VAN GOOL L , WILLIAMS C K I , et al. The pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 2010, 88 (2): 303- 338. doi: 10.1007/s11263-009-0275-4
18	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1405.0312.
19	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.
20	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 770-778.
21	ZHENG S X, LU J C, ZHAO H S, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 6877-6886.
22	XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with Transformers[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2105.15203.
23	REED S, AKATA Z, YAN X C, et al. Generative adversarial text to image synthesis[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1605.05396.
24	ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 5908-5916.
25	ZHANG H , XU T , LI H S , et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (8): 1947- 1962. doi: 10.1109/TPAMI.2018.2856256
26	GOODFELLOW I , POUGET-ABADIE J , MIRZA M , et al. Generative adversarial networks. Communications of the ACM, 2020, 63 (11): 139- 144. doi: 10.1145/3422622
27	CHEN M, RADFORD A, CHILD R, et al. Generative pretraining from pixels[C]//Proceedings of the International Conference on Machine Learning. Washington D.C., USA: IEEE Press, 2020: 1691-1703.
28	ESSER P, ROMBACH R, OMMER B. Taming Transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 12868-12878.
29	JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2102.07074.
30	DING M, YANG Z Y, HONG W Y, et al. CogView: mastering text-to-image generation via Transformers[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2105.13290.
31	VAN DEN OORD A, VINYALS O. Neural discrete representation learning[EB/OL]. [2023-09-15]. https://arxiv.org/abs/1711.00937#:~:text=Neural%20Discrete%20Representation%20Learning.%20Aaron%20van%20den%20Oord.
32	LIU S, FAN H Q, QIAN S S, et al. HiT: hierarchical Transformer with momentum contrast for video-text retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 11895-11905.
33	LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 17928-17937.
34	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2014: 580-587.
35	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 764-773.
36	ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable Transformers for end-to-end object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2010.04159.
37	DAI X Y, CHEN Y P, YANG J W, et al. Dynamic DETR: end-to-end object detection with dynamic attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 2968-2977.
38	YAO Z Y, AI J B, LI B X, et al. Efficient DETR: improving end-to-end object detector with dense prior[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2104.01318.
39	GAO P, ZHENG M H, WANG X G, et al. Fast convergence of DETR with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3601-3610.
40	ROH B, SHIN J, SHIN W, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2111.14330.
41	ZHAO Y A, LV W Y, XU S L, et al. DETRs beat YOLOs on real-time object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2304.08069.
42	LI F, ZENG A L, LIU S L, et al. Lite DETR: an interleaved multi-scale encoder for efficient DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 18558-18567.
43	ZHANG G J, LUO Z P, YU Y C, et al. Accelerating DETR convergence via semantic-aligned matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 939-948.
44	CAO X P, YUAN P, FENG B L, et al. CF-DETR: coarse-to-fine Transformers for end-to-end object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 185-193.
45	LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2201.12329.
46	MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3631-3640.
47	ZHANG H, LI F, LIU S L, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2023-09-15]. http://arxiv.org/abs/2203.03605.
48	DAI Z G, CAI B L, LIN Y G, et al. UP-DETR: unsupervised pre-training for object detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 1601-1610.
49	LI F, ZHANG H, LIU S L, et al. DN-DETR: accelerate DETR training by introducing query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13609-13617.
50	CHEN Q, CHEN X K, WANG J, et al. Group DETR: fast DETR training with group-wise one-to-many assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6610-6619.

[1]	LI Junjun, DONG Jiangang, LI Kun. Research on Kubernetes-based Cluster Energy-Saving Strategy [J]. Computer Engineering, 2024, 50(9): 82-91.
[2]	LIN Chang, GUO Wei, REN Zhecong, JIN Haibo. Unification Algorithm for Object Tracking and Segmentation Based on Transformer [J]. Computer Engineering, 2024, 50(9): 130-141.
[3]	LI Zelin, LÜ Zhaofeng, CHEN Fuqiang, LI Ke. Entity Alignment Model Based on Multi-Hop Information Fusion [J]. Computer Engineering, 2024, 50(9): 142-152.
[4]	WANG Ruying, MA Jiajun, DONG Jianqiang, LIU Wanlong, ZHANG Haitao, YIN Kai, ZHAO Bochao. Industrial Load Forecasting Method Based on MTS-BiGRU-DMHSA [J]. Computer Engineering, 2024, 50(9): 169-178.
[5]	ZHU Kai, LI Li, ZHANG Tong, JIANG Sheng, BIE Yiming. Multi-Stage Motion Blur Image Restoration Network Based on Transformer [J]. Computer Engineering, 2024, 50(9): 276-285.
[6]	WEI Wei, DING Xiangxiang, GUO Mengxing, YANG Zhao, LIU Hui. Review of Text Similarity Calculation Methods [J]. Computer Engineering, 2024, 50(9): 18-32.
[7]	ZHANG Tianpeng, HAN Jing, LÜ Xueqiang. Super-Resolution-Aided Small-Target Detection Based on Multi-Task Learning [J]. Computer Engineering, 2024, 50(9): 304-312.
[8]	GUO Min, ZHANG Xihan, LI Yang. Integrated Attentional Teacher Mutual Consistency Semi-Supervised Medical Image Segmentation [J]. Computer Engineering, 2024, 50(9): 313-323.
[9]	GAO Yubao, WEN Zhicheng. Dual Decoder Image Denoising Method Based on Attention Mechanism [J]. Computer Engineering, 2024, 50(9): 324-332.
[10]	ZENG Yuqi, LIU Bo, ZHONG Baichang, ZHONG Jin. Student Classroom Behavior Detection Algorithm Based on Improved YOLOv8 in Smart Education [J]. Computer Engineering, 2024, 50(9): 344-355.
[11]	Suzhe WANG, Xueying ZHANG, Xiaoyu CHEN, Fenglian LI, Zeling WU. EEG Enhancement Algorithm Based on Combination of Effective Attention and GAN [J]. Computer Engineering, 2024, 50(8): 336-344.
[12]	Wei GAO, Shuailong LI, Lin MAO, Lei WANG, Yingying LI, Lin HAN. One Acceleration Strategy for Operator Generation Based on TVM [J]. Computer Engineering, 2024, 50(8): 353-362.
[13]	Yu WANG, Qi QI, Chun WANG, Cai XU. High-Precision Fault Diagnosis Method for Energy Storage Inverter Signals [J]. Computer Engineering, 2024, 50(8): 389-396.
[14]	Huaqing ZHANG, Zhangtao XIA, Xiaoqing LU, Jijun TONG. Named Entity Recognition of Vascular Surgery Based on Glyph Features [J]. Computer Engineering, 2024, 50(8): 13-21.
[15]	Rixin RAO, Yiwen WANG, Lizhi ZENG, Xintian TONG, Haitao ZHAO. Lightweight Network Model for Waste Cable Detection [J]. Computer Engineering, 2024, 50(8): 22-30.

Please choose a citation manager

Content to export