Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (4): 62-81. doi: 10.19678/j.issn.1000-3428.0069312

• Frontier Perspectives and Reviews • Previous Articles    

Review of DETR Object Detection Algorithm Based on Transformer

LI Yiyang1, LU Shenglian1, WANG Jijie2, CHEN Ming1   

  1. 1. Guangxi Key Laboratory of Multi-Source Information Mining and Security, School of Computer Science and Engineering, Guangxi Normal University, Guilin 541004, Guangxi, China;
    2. Academic Affairs Office, Guangxi Normal University, Guilin 541004, Guangxi, China
  • Received:2024-01-29 Revised:2024-08-08 Published:2024-12-11

基于Transformer的DETR目标检测算法综述

李沂杨1, 陆声链1, 王继杰2, 陈明1   

  1. 1. 广西师范大学计算机科学与工程学院广西多源信息挖掘与安全重点实验室, 广西 桂林 541004;
    2. 广西师范大学教务处, 广西 桂林 541004
  • 作者简介:李沂杨(CCF学生会员),男,硕士研究生,主研方向为人工智能、计算机视觉;陆声链(通信作者),教授、博士,E-mail:lsl@gxnu.edu.cn;王继杰,教授、博士、博士生导师;陈明,工程师、硕士。
  • 基金资助:
    国家自然科学基金(61662006);广西多源信息挖掘与安全重点实验室主任基金(20-A-02-02)。

Abstract: Convolutional Neural Networks (CNNs) are widely used in the field of object detection, earning widespread acclaim in scholarly circles due to their precision and scalability. It has spawned numerous notable models, including those in the Region-based Convolutional Neural Networks (R-CNNs) (such as Fast R-CNN and Faster R-CNN) and You Only Look Once (YOLO) series. After the success of Transformers in the field of natural language processing, researchers began exploring their application in computer vision, leading to the development of visual backbone networks such as Visual Transformer (ViT) and Swin Transformer. In 2020, a Facebook research team unveiled DEtection TRansformer (DETR), an end-to-end object detection algorithm based on Transformers, designed to minimize the need for prior knowledge and postprocessing in object detection tasks. Despite the promise shown by DETR in object detection, it has limitations including low convergence speed, relatively low accuracy, and the ambiguous physical significance of target queries. These issues have spurred a wave of research aimed at refining and enhancing the algorithm. This paper aims to collate, scrutinize, and synthesize the various efforts aimed at improving DETR, assessing their respective merits and demerits. Furthermore, it presents a comprehensive overview of state-of-the-art research and specialized application domains that employ DETR and concludes with a prospective analysis of the future role of DETR in the field of computer vision.

Key words: computer vision, object detection, DETR algorithm, Visual Transformer (ViT), image segmentation

摘要: 在目标检测领域,卷积神经网络(CNN)凭借其优异的准确性和可扩展性,长期主导着相关研究,并获得了学术界的广泛认可。在此框架下,先后涌现出基于区域的卷积神经网络(R-CNN)系列(如Fast R-CNN、Faster R-CNN)与YOLO(You Only Look Once)系列等多个代表性模型。随着Transformer在自然语言处理领域的成功,研究者开始探索将其用于计算机视觉领域,由此产生了视觉Transformer(ViT)和Swin Transformer等视觉骨干网络。Facebook团队为减少目标检测任务中的先验知识和后处理,在2020年推出了一种端到端目标检测算法——基于Transformer的DETR(DEtection TRansformer)。尽管DETR在目标检测领域展现出潜力,但也存在收敛速度慢、准确性较差、目标查询的物理意义不明确等缺点。这促使研究者对该算法开展了进一步的研究和改进。本研究旨在归纳总结针对DETR的改进探索,并分析它们的优势与不足,同时对利用DETR开展的前沿研究和细分应用领域进行概括,最后给出DETR在计算机视觉领域的未来展望。

关键词: 计算机视觉, 目标检测, DETR算法, 视觉Transformer, 图像分割

CLC Number: