Review of DETR Object Detection Algorithm Based on Transformer

doi:10.19678/j.issn.1000-3428.0069312

Abstract

Abstract:

Convolutional Neural Networks (CNNs) are widely used in the field of object detection, earning widespread acclaim in scholarly circles due to their precision and scalability. It has spawned numerous notable models, including those in the Region-based Convolutional Neural Networks (R-CNNs) (such as Fast R-CNN and Faster R-CNN) and You Only Look Once (YOLO) series. After the success of Transformers in the field of natural language processing, researchers began exploring their application in computer vision, leading to the development of visual backbone networks such as Visual Transformer (ViT) and Swin Transformer. In 2020, a Facebook research team unveiled DEtection TRansformer (DETR), an end-to-end object detection algorithm based on Transformers, designed to minimize the need for prior knowledge and postprocessing in object detection tasks. Despite the promise shown by DETR in object detection, it has limitations including low convergence speed, relatively low accuracy, and the ambiguous physical significance of target queries. These issues have spurred a wave of research aimed at refining and enhancing the algorithm. This paper aims to collate, scrutinize, and synthesize the various efforts aimed at improving DETR, assessing their respective merits and demerits. Furthermore, it presents a comprehensive overview of state-of-the-art research and specialized application domains that employ DETR and concludes with a prospective analysis of the future role of DETR in the field of computer vision.

Key words: computer vision, object detection, DETR algorithm, Visual Transformer (ViT), image segmentation

摘要：

在目标检测领域, 卷积神经网络(CNN)凭借其优异的准确性和可扩展性, 长期主导着相关研究, 并获得了学术界的广泛认可。在此框架下, 先后涌现出基于区域的卷积神经网络(R-CNN)系列(如Fast R-CNN、Faster R-CNN)与YOLO(You Only Look Once)系列等多个代表性模型。随着Transformer在自然语言处理领域的成功, 研究者开始探索将其用于计算机视觉领域, 由此产生了视觉Transformer(ViT)和Swin Transformer等视觉骨干网络。Facebook团队为减少目标检测任务中的先验知识和后处理, 在2020年推出了一种端到端目标检测算法——基于Transformer的DETR(DEtection TRansformer)。尽管DETR在目标检测领域展现出潜力, 但也存在收敛速度慢、准确性较差、目标查询的物理意义不明确等缺点。这促使研究者对该算法开展了进一步的研究和改进。本研究旨在归纳总结针对DETR的改进探索, 并分析它们的优势与不足, 同时对利用DETR开展的前沿研究和细分应用领域进行概括, 最后给出DETR在计算机视觉领域的未来展望。

关键词: 计算机视觉, 目标检测, DETR算法, 视觉Transformer, 图像分割

LI Yiyang, LU Shenglian, WANG Jijie, CHEN Ming. Review of DETR Object Detection Algorithm Based on Transformer[J]. Computer Engineering, 2026, 52(4): 62-81.

李沂杨, 陆声链, 王继杰, 陈明. 基于Transformer的DETR目标检测算法综述[J]. 计算机工程, 2026, 52(4): 62-81.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069312

https://www.ecice06.com/EN/Y2026/V52/I4/62

Figures/Tables 17

Fig.1 Encoder-decoder structure of DETR

Fig.2 Attention scores increasing with Epochs in different decoder layers

Fig.3 Attention structure of Deformable-DETR

Fig.4 Structure of Co-DETR

Fig.5 Detection flow of Efficient DETR

Fig.6 Architecture of DINO model

Fig.7 Copmarison of hybrid matching methods on Deformable-DETR

Fig.8 RT-DETR architecture

Fig.9 Meta-DETR architecture

Fig.10 Encoder-decoder architecture of UP-DETR

Fig.11 Decoder architecture of Dynamic-DETR

Fig.12 Decoder architecture of MS-DETR

References 50

1	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60 (6): 84- 90. doi: 10.1145/3065386
2	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 779-788.
3	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of ECCV'16. Berlin, Germany: Springer International Publishing, 2016: 21-37.
4	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2024-01-14]. https://arxiv.org/abs/1706.03762.
5	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2010.11929.
6	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 9992-10002.
7	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of ECCV'20. Berlin, Germany: Springer International Publishing, 2020: 213-229.
8	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 770-778.
9	LIU Y D, WANG Y T, WANG S W, et al. CBNet: a novel composite backbone network architecture for object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 11653-11660.
10	SUN Z Q, CAO S C, YANG Y M, et al. Rethinking Transformer-based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3591-3600.
11	GAO P, ZHENG M H, WANG X G, et al. Fast convergence of DETR with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3601-3610.
12	YE M Q, KE L, LI S Y, et al. Cascade-DETR: delving into high-quality universal object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6681-6691.
13	ROH B, SHIN J, SHIN W C, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2111.14330.
14	ZHENG D H, DONG W H, HU H L, et al. Less is more: focus attention for efficient DETR[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6651-6660.
15	王国明, 贾代旺. 基于YOLOv8的小目标检测模型的优化. 计算机工程, 2025, 51 (12): 294- 303. doi: 10.19678/j.issn.1000-3428.0070027
	WANG G M , JIA D W . Optimization of small object detection model based on YOLOv8. Computer Engineering, 2025, 51 (12): 294- 303. doi: 10.19678/j.issn.1000-3428.0070027
16	董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述. 计算机工程与应用, 2023, 59 (11): 16- 27.
	DONG G , XIE W C , HUANG X L , et al. Review of small object detection algorithms based on deep learning. Computer Engineering and Applications, 2023, 59 (11): 16- 27.
17	ZHANG J, HUANG J, LUO Z, et al. DA-DETR: domain adaptive detection Transformer with information fusion[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2103.17084.
18	WANG T, YUAN L, CHEN Y P, et al. PnP-DETR: towards efficient visual analysis with Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 4641-4650.
19	ZHANG C, LIU L, ZANG X, et al. DETR++: taming your multi-scale detection Transformer[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2206.02977.
20	TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 1-9.
21	ZONG Z F, SONG G L, LIU Y. DETRs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6725-6735.
22	YAO Z, AI J, LI B, et al. Efficient DETR: improving end-to-end object detector with dense prior[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2104.01318.
23	MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3631-3640.
24	CHEN X, WEI F, ZENG G, et al. Conditional DETR V2: efficient detection Transformer with box queries[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2207.08914.
25	WANG Y M, ZHANG X Y, YANG T, et al. Anchor DETR: query design for Transformer-based detector[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 2567-2575.
26	LIU S, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2201.12329.
27	LIU Y, ZHANG Y, WANG Y X, et al. SAP-DETR: bridging the gap between salient points and queries-based Transformer detector for fast model convergency[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 15539-15547.
28	LI F, ZHANG H, LIU S L, et al. DN-DETR: accelerate DETR training by introducing query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13609-13617.
29	ZHANG H, LI F, LIU S, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2203.03605.
30	CHEN Q, CHEN X K, WANG J, et al. Group DETR: fast DETR training with group-wise one-to-many assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6610-6619.
31	JIA D, YUAN Y H, HE H D, et al. DETRs with hybrid matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 19702-19712.
32	潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述. 中国图象图形学报, 2023, 28 (9): 2587- 2615.
	PAN X Y , JIA N X , MU Y Z , et al. Survey of small object detection. Journal of Image and Graphics, 2023, 28 (9): 2587- 2615.
33	王福军, 王星, 王柯迪. 基于双域查询增强Transformer的遥感图像旋转小目标检测. 吉林大学学报(理学版), 2025, 63 (5): 1418- 1426.
	WANG F J , WANG X , WANG K D . Rotated small object detection of remote sensing images based on dual-domain query enhanced Transformer. Journal of Jilin University (Science Edition), 2025, 63 (5): 1418- 1426.
34	LI F, ZENG A L, LIU S L, et al. Lite DETR: an interleaved multi-scale encoder for efficient DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 18558-18567.
35	ZHAO Y, LÜ W, XU S, et al. DETRs beat YOLOs on real-time object detection[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2304.08069.
36	ZHANG G, LUO Z, CUI K, et al. Meta-DETR: image-level few-shot object detection with inter-class correlation exploitation[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2103.11731.
37	BULAT A, GUERRERO R, MARTINEZ B, et al. FS-DETR: few-shot detection Transformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 11759-11768.
38	RADFORD A, KIM J, HALLACY C, et al. Learning Transferable visual models from natural language supervision[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2103.00020.
39	RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL]. [2024-01-14]. https://www.semanticscholar.org/paper/ Improving-Language-Understanding -by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
40	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2024-01-14]. https://arxiv.org/abs/1810.04805.
41	DAI Z G , CAI B L , LIN Y G , et al. Unsupervised pre-training for detection Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (11): 12772- 12782.
42	CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments[EB/OL]. [2024-01-14]. https://arxiv.org/abs/2006.09882.
43	CHEN Z R, HUANG G S, LI W, et al. Siamese DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 15722-15731.
44	LIU S L, HUANG S J, LI F, et al. DQ-DETR: dual query detection Transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 1728-1736.
45	KAMATH A, SINGH M, LECUN Y, et al. MDETR—modulated detection for end-to-end multi-modal understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 1760-1770.
46	SHI F Y , GAO R P , HUANG W L , et al. Dynamic MDETR: a dynamic multimodal Transformer decoder for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (2): 1181- 1198. doi: 10.1109/TPAMI.2023.3328185
47	ZANG Y H, LI W, ZHOU K Y, et al. Open-vocabulary DETR with conditional matching[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer Nature Switzerland, 2022: 106-122.
48	WANG J, SUN A, ZHANG H, et al. MS-DETR: natural language video localization with sampling moment-moment interaction[EB/OL].[2024-01-14]. https://arxiv.org/abs/2305.18969.
49	周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述. 中国图象图形学报, 2023, 28 (10): 2969- 3003.
	ZHOU L J , MAO J N . Vision Transformer-based recognition tasks: a critical review. Journal of Image and Graphics, 2023, 28 (10): 2969- 3003.
50	王杨, 宋世佳, 王鹤琴, 等. 基于改进Vision Transformer的局部光照一致性估计. 计算机工程, 2025, 51 (2): 312- 321. doi: 10.19678/j.issn.1000-3428.0068905
	WANG Y , SONG S J , WANG H Q , et al. Estimation of local illumination consistency based on improved Vision Transformer. Computer Engineering, 2025, 51 (2): 312- 321. doi: 10.19678/j.issn.1000-3428.0068905

[1]	XIANG Haiyun, ZHOU Yao, CHEN Xi. Transferable Adversarial Example Generation Algorithm for Object Detection [J]. Computer Engineering, 2026, 52(6): 238-248.
[2]	LIU Yujie, DU Zhonghao, LI Xuanting, LI Zongmin. Semi-supervised Medical Image Segmentation Model Based on Selective Supervision and Dynamic Threshold [J]. Computer Engineering, 2026, 52(6): 121-131.
[3]	TAN Zihong, PAN An, TONG Jing, LIU Yaohui, WEI Jian. Segmentation and Embolization Simulation of Uterine Artery for Cesarean Scar Pregnancy [J]. Computer Engineering, 2026, 52(6): 339-351.
[4]	DAI Yinqiao, XIAO Wulong, LI Bailin, LI Li. Lettuce Core Detection Algorithm Based on Improved YOLOv5s [J]. Computer Engineering, 2026, 52(6): 352-364.
[5]	TIAN Hui, DUAN Xinlong, HAO Qiya, SUI Wenhao, MA Yuying, YU Zuhua, XU Yang, CAO Yangjie. Cell Counting Method Combining Multi-Scale Feature Fusion and Improved ViT [J]. Computer Engineering, 2026, 52(5): 203-215.
[6]	SONG Tianze, CAO Congjun, HE Jiaqi, WANG Xusheng, LIU Chenyu. Research on Dense Pedestrian Detection Algorithm Based on Improved DETR [J]. Computer Engineering, 2026, 52(5): 250-258.
[7]	LI Hui, LIU Jiayu, XU Yaping. Review on Deep Learning Model Architectures and Performance Evaluation Methods for Medical Image Segmentation [J]. Computer Engineering, 2026, 52(5): 81-94.
[8]	WEI Wenquan, MO Hongwei. PCB Defect Detection Algorithm Based on Improved YOLOv5s [J]. Computer Engineering, 2026, 52(5): 226-238.
[9]	LI Luyang, YAN Jinlong, FANG Zeru, JIN Qiqi, XUE Hongxin. 3D Small Object Detection Algorithm Based on Dynamic Feature Enhancement [J]. Computer Engineering, 2026, 52(4): 264-275.
[10]	HAO Yousheng, WEN Zhenhui, FENG Xiaoxi, DENG Zehua, HUANG Qingbao. Vehicle Paint Defect Detection Based on Improved YOLOv8 [J]. Computer Engineering, 2026, 52(4): 252-263.
[11]	CAO Jiwei, LUO Fei, DING Weichao. BS-YOLO: A Small Object Detection Algorithm Based on BSAM Attention Mechanism and SCConv [J]. Computer Engineering, 2026, 52(3): 119-127.
[12]	CHENG Junjun, WANG Mingwen. Research on Finger-vein Recognition Based on Deep Graph Convolutional Network with Dual-Branch [J]. Computer Engineering, 2026, 52(3): 152-160.
[13]	XIE Binhong, SHI Yufei, ZHANG Rui, ZHANG Yingjun. Few-shot Object Detection Method Based on Query Guidance and Semantic Enhancement [J]. Computer Engineering, 2026, 52(3): 141-151.
[14]	QIN Yingxin, ZHANG Kejia, PAN Haiwei, JU Yahao. Adversarial Attacks in Computer Vision: A Survey [J]. Computer Engineering, 2026, 52(2): 46-68.
[15]	LI Jianlang, WU Xindian, CHEN Ling, YANG Bo, TANG Wensheng. 3D Object Detection Algorithm Based on 4D Millimeter-Wave Radar and Vision Fusion [J]. Computer Engineering, 2026, 52(2): 299-310.

Please choose a citation manager

Content to export