LI YiYang, LU ShengLian, WANG JiJie, CHEN Ming
Accepted: 2024-12-11
Convolutional Neural Networks (CNNs) have established their supremacy in the realm of object detection, earning widespread acclaim in scholarly circles for their precision and scalability. This domain has spawned numerous notable models, including the R-CNN series (FastRCNN, FasterRCNN, and others) and the YOLO series. After the success of Transformers in the field of natural language processing, researchers began to explore their application in computer vision, leading to the development of visual backbone networks such as ViT and Swin-ViT. In 2020, the Facebook research team unveiled DETR, an end-to-end object detection algorithm based on Transformers, designed to minimize the need for prior knowledge and post-processing in object detection tasks. Despite the promise shown by DETR in object detection, it is not without its shortcomings, including slow convergence speed, diminished accuracy, and the ambiguous physical significance of target queries. These issues have spurred a wave of research aimed at refining and enhancing the algorithm. This paper endeavors to collate, scrutinize, and synthesize the various efforts directed towards the improvement of DETR, assessing their respective merits and demerits. Furthermore, it offers a comprehensive overview of state-of-the-art research and specialized application domains that employ DETR, and concludes with a prospective analysis of DETR’s future role in the field of computer vision.