基于Transformer的DETR目标检测算法研究综述

doi:10.19678/j.issn.1000-3428.0069312

摘要/Abstract

摘要： 目标检测领域中，卷积神经网络（CNN）长期占据主导地位，并以其准确性和可扩展性在学术界得到广泛认可。目标检测领域先后涌现出多个代表性模型，如R-CNN系列（包括FastRCNN、FasterRCNN等）和YOLO系列。随着Transformer在自然语言处理领域的成功，研究者开始探索将其用于计算机视觉，由此产生了如ViT和Swin-ViT等视觉骨干网络。2020年，Facebook团队为减少目标检测任务中的先验知识和后处理，推出了基于Transformer的 DETR，一种端到端目标检测算法。尽管DETR在目标检测领域展现出潜力，但也存在诸如收敛速度慢、准确性较差、目标查询的物理意义不明确等缺点。这促使诸多研究人员对该算法开展了进一步的研究和改进。本文旨在分析整理总结针对DETR的改进探索，并分析他们的优势与不足，同时对利用DETR开展的前沿研究和细分应用领域进行概括，最后给出DETR在计算机视觉领域的未来展望。

Abstract: Convolutional Neural Networks (CNNs) have established their supremacy in the realm of object detection, earning widespread acclaim in scholarly circles for their precision and scalability. This domain has spawned numerous notable models, including the R-CNN series (FastRCNN, FasterRCNN, and others) and the YOLO series. After the success of Transformers in the field of natural language processing, researchers began to explore their application in computer vision, leading to the development of visual backbone networks such as ViT and Swin-ViT. In 2020, the Facebook research team unveiled DETR, an end-to-end object detection algorithm based on Transformers, designed to minimize the need for prior knowledge and post-processing in object detection tasks. Despite the promise shown by DETR in object detection, it is not without its shortcomings, including slow convergence speed, diminished accuracy, and the ambiguous physical significance of target queries. These issues have spurred a wave of research aimed at refining and enhancing the algorithm. This paper endeavors to collate, scrutinize, and synthesize the various efforts directed towards the improvement of DETR, assessing their respective merits and demerits. Furthermore, it offers a comprehensive overview of state-of-the-art research and specialized application domains that employ DETR, and concludes with a prospective analysis of DETR’s future role in the field of computer vision.

李沂杨, 陆声链, 王继杰, 陈明. 基于Transformer的DETR目标检测算法研究综述[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0069312.

LI YiYang, LU ShengLian, WANG JiJie, CHEN Ming. A Comprehensive Review of the DETR Object Detection Algorithm Based on Transformer[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0069312.

参考文献

[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[J/OL]. Communications of the ACM, 2017: 84-90. http://dx.doi.org/10.1145/3065386. DOI:10.1145/3065386.
[2] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.91. DOI:10.1109/cvpr.2016.91.
[3] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot MultiBox Detector[M/OL]//Computer Vision – ECCV 2016,Lecture Notes in Computer Science. 2016: 21-37. http://dx.doi.org/10.1007/978-3-319-46448-0_2. DOI:10.1007/978-3-319-46448-0_2.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[J]. Neural Information Processing Systems, 2017.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[J]. arXiv: Computer Vision and Pattern Recognition, 2020.
[6] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00986. DOI:10.1109/iccv48922.2021.00986.
[7] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers[M/OL]//Computer Vision – ECCV 2020,Lecture Notes in Computer Science. 2020: 213-229. http://dx.doi.org/10.1007/978-3-030-58452-8_13. DOI:10.1007/978-3-030-58452-8_13.
[8] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.90. DOI:10.1109/cvpr.2016.90.
[9] LIU Y, WANG Y, WANG S, et al. CBNet: A Novel Composite Backbone Network Architecture for Object Detection[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11653-11660. http://dx.doi.org/10.1609/aaai.v34i07.6834. DOI:10.1609/aaai.v34i07.6834.
[10] SUN Z, CAO S, YANG Y, et al. Rethinking Transformer-based Set Prediction for Object Detection[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00359. DOI:10.1109/iccv48922.2021.00359.
[11] GAO P, ZHENG M, WANG X, et al. Fast Convergence of DETR with Spatially Modulated Co-Attention[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00360. DOI:10.1109/iccv48922.2021.00360.
[12] YE M, KE L, LI S, et al. Cascade-DETR: Delving into High-Quality Universal Object Detection[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00617.
[13] ROH B, SHIN J, SHIN W C, et al. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity[J]. Cornell University - arXiv, 2021.
[14] ZHENG D, DONG W, HU H, et al. Less is More: Focus Attention for Efficient DETR[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00614.
[15] 曹健, 陈怡梅, 李海生, 等. 基于深度学习的道路小目标检测综述[J]. 计算机工程, 2023: 17.
[16] 董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27.
[17] Zhang J, Huang J, Luo Z, et al. DA-DETR: Domain Adaptive Detection Transformer with Information Fusion[J]. arXiv preprint arXiv:2103.17084, 2021.
[18] WANG T, YUAN L, CHEN Y, et al. PnP-DETR: Towards Efficient Visual Analysis with Transformers[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00462. DOI:10.1109/iccv48922.2021.00462.
[19] Zhang C, Liu L, Zang X, et al. Detr++: Taming your multi-scale detection transformer[J]. arXiv preprint arXiv:2206.02977, 2022.
[20] TAN M, PANG R, LE Q V. EfficientDet: Scalable and Efficient Object Detection[C/OL]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. http://dx.doi.org/10.1109/cvpr42600.2020.01079. DOI:10.1109/cvpr42600.2020.01079.
[21] Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 6748-6758.
[22] YAO Z, AI J, LI B, et al. Efficient DETR: Improving End-to-End Object Detector with Dense Prior[J]. Cornell University - arXiv, 2021.
[23] Meng D, Chen X, Fan Z, et al. Conditional detr for fast training convergence[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3651-3660.
[24] Chen X, Wei F, Zeng G, et al. Conditional detr v2: Efficient detection transformer with box queries[J]. arXiv preprint arXiv:2207.08914, 2022.
[25] Wang Y, Zhang X, Yang T, et al. Anchor detr: Query design for transformer-based detector[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2567-2575.
[26] Liu S, Li F, Zhang H, et al. Dab-detr: Dynamic anchor boxes are better queries for detr[J]. arXiv preprint arXiv:2201.12329, 2022.
[27] LIU Y, ZHANG Y, WANG Y, et al. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency[J]. 2022.
[28] Li F, Zhang H, Liu S, et al. Dn-detr: Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 13619-13627.
[29] ZHANG H, LI F, LIU S, et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J]. arXiv preprint arXiv:2203.03605, 2022.
[30] CHEN Q, CHEN X, WANG J, et al. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment[J]. 2022.
[31] JIA D, YUAN Y, HE H, et al. DETRs with Hybrid Matching[C/OL]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. DOI:10.1109/cvpr52729.2023.01887.
[32] 潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述[J]. 中国图象图形学报, 2023, 28(09): 2587-2615.
[33] 陈洛轩, 林成创, 郑招良, 等. Transformer在计算机视觉场景下的研究综述[J]. 计算机科学, 2023: 29
4] Li F, Zeng A, Liu S, et al. Lite DETR: An interleaved multi-scale encoder for efficient detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18558-18567.
[35] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection[J]. arXiv preprint arXiv:2304.08069, 2023.
[36] ZHANG G, LUO Z, CUI K, et al. Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation[J]. Cornell University - arXiv, 2021.
[37] Bulat A, Guerrero R, Martinez B, et al. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 11793-11802.
[38] RADFORD A, KIM J, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[J]. Cornell University - arXiv, 2021.
[39] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018.
[40] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C/OL]//Proceedings of the 2019 Conference of the North. 2019. http://dx.doi.org/10.18653/v1/n19-1423. DOI:10.18653/v1/n19-1423.
[41] DAI Z, CAI B, LIN Y, et al. Unsupervised Pre-Training for Detection Transformers[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022: 1-11. http://dx.doi.org/10.1109/tpami.2022.3216514. DOI:10.1109/tpami.2022.3216514.
[42] CARON M, MISRA I, MAIRAL J, et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments[J]. Le Centre pour la Communication Scientifique Directe - HAL - Université Paris Descartes, 2020.
[43] Chen Z, Huang G, Li W, et al. Siamese detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15722-15731.
[44] Liu S, Huang S, Li F, et al. DQ-DETR: Dual query detection transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(2): 1728-1736.
[45] KAMATH A, SINGH M, LECUN Y, et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00180. DOI:10.1109/iccv48922.2021.00180.
[46] Shi F, Gao R, Huang W, et al. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[47] Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 106-122.
[48] Wang J, Sun A, Zhang H, et al. MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction[J]. arXiv preprint arXiv:2305.18969, 2023.
[49] 周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003.
[50] 李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[J/OL]. Communications of the ACM, 2017: 84-90. http://dx.doi.org/10.1145/3065386. DOI:10.1145/3065386. [2] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.91. DOI:10.1109/cvpr.2016.91. [3] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot MultiBox Detector[M/OL]//Computer Vision – ECCV 2016,Lecture Notes in Computer Science. 2016: 21-37. http://dx.doi.org/10.1007/978-3-319-46448-0_2. DOI:10.1007/978-3-319-46448-0_2. [4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[J]. Neural Information Processing Systems, 2017. [5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[J]. arXiv: Computer Vision and Pattern Recognition, 2020. [6] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00986. DOI:10.1109/iccv48922.2021.00986. [7] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers[M/OL]//Computer Vision – ECCV 2020,Lecture Notes in Computer Science. 2020: 213-229. http://dx.doi.org/10.1007/978-3-030-58452-8_13. DOI:10.1007/978-3-030-58452-8_13. [8] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.90. DOI:10.1109/cvpr.2016.90. [9] LIU Y, WANG Y, WANG S, et al. CBNet: A Novel Composite Backbone Network Architecture for Object Detection[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11653-11660. http://dx.doi.org/10.1609/aaai.v34i07.6834. DOI:10.1609/aaai.v34i07.6834. [10] SUN Z, CAO S, YANG Y, et al. Rethinking Transformer-based Set Prediction for Object Detection[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00359. DOI:10.1109/iccv48922.2021.00359. [11] GAO P, ZHENG M, WANG X, et al. Fast Convergence of DETR with Spatially Modulated Co-Attention[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00360. DOI:10.1109/iccv48922.2021.00360. [12] YE M, KE L, LI S, et al. Cascade-DETR: Delving into High-Quality Universal Object Detection[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00617. [13] ROH B, SHIN J, SHIN W C, et al. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity[J]. Cornell University - arXiv, 2021. [14] ZHENG D, DONG W, HU H, et al. Less is More: Focus Attention for Efficient DETR[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00614. [15] 曹健, 陈怡梅, 李海生, 等. 基于深度学习的道路小目标检测综述[J]. 计算机工程, 2023: 17. [16] 董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27. [17] Zhang J, Huang J, Luo Z, et al. DA-DETR: Domain Adaptive Detection Transformer with Information Fusion[J]. arXiv preprint arXiv:2103.17084, 2021.. [18] WANG T, YUAN L, CHEN Y, et al. PnP-DETR: Towards Efficient Visual Analysis with Transformers[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00462. DOI:10.1109/iccv48922.2021.00462. [19] Zhang C, Liu L, Zang X, et al. Detr++: Taming your multi-scale detection transformer[J]. arXiv preprint arXiv:2206.02977, 2022. [20] TAN M, PANG R, LE Q V. EfficientDet: Scalable and Efficient Object Detection[C/OL]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. http://dx.doi.org/10.1109/cvpr42600.2020.01079. DOI:10.1109/cvpr42600.2020.01079. [21] Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 6748-6758. [22] YAO Z, AI J, LI B, et al. Efficient DETR: Improving End-to-End Object Detector with Dense Prior[J]. Cornell University - arXiv, 2021. [23] Meng D, Chen X, Fan Z, et al. Conditional detr for fast training convergence[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3651-3660. [24] Chen X, Wei F, Zeng G, et al. Conditional detr v2: Efficient detection transformer with box queries[J]. arXiv preprint arXiv:2207.08914, 2022. [25] Wang Y, Zhang X, Yang T, et al. Anchor detr: Query design for transformer-based detector[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2567-2575. [26] Liu S, Li F, Zhang H, et al. Dab-detr: Dynamic anchor boxes are better queries for detr[J]. arXiv preprint arXiv:2201.12329, 2022. [27] LIU Y, ZHANG Y, WANG Y, et al. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency[J]. 2022. [28] Li F, Zhang H, Liu S, et al. Dn-detr: Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 13619-13627. [29] ZHANG H, LI F, LIU S, et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J]. arXiv preprint arXiv:2203.03605, 2022. [30] CHEN Q, CHEN X, WANG J, et al. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment[J]. 2022. [31] JIA D, YUAN Y, HE H, et al. DETRs with Hybrid Matching[C/OL]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. DOI:10.1109/cvpr52729.2023.01887. [32] 潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述[J]. 中国图象图形学报, 2023, 28(09): 2587-2615. [33] 陈洛轩, 林成创, 郑招良, 等. Transformer在计算机视觉场景下的研究综述[J]. 计算机科学, 2023: 29. [34] Li F, Zeng A, Liu S, et al. Lite DETR: An interleaved multi-scale encoder for efficient detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18558-18567. [35] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection[J]. arXiv preprint arXiv:2304.08069, 2023. [36] ZHANG G, LUO Z, CUI K, et al. Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation[J]. Cornell University - arXiv, 2021. [37] Bulat A, Guerrero R, Martinez B, et al. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 11793-11802. [38] RADFORD A, KIM J, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[J]. Cornell University - arXiv, 2021. [39] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018. [40] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C/OL]//Proceedings of the 2019 Conference of the North. 2019. http://dx.doi.org/10.18653/v1/n19-1423. DOI:10.18653/v1/n19-1423. [41] DAI Z, CAI B, LIN Y, et al. Unsupervised Pre-Training for Detection Transformers[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022: 1-11. http://dx.doi.org/10.1109/tpami.2022.3216514. DOI:10.1109/tpami.2022.3216514. [42] CARON M, MISRA I, MAIRAL J, et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments[J]. Le Centre pour la Communication Scientifique Directe - HAL - Université Paris Descartes, 2020. [43] Chen Z, Huang G, Li W, et al. Siamese detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15722-15731. [44] Liu S, Huang S, Li F, et al. DQ-DETR: Dual query detection transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(2): 1728-1736. [45] KAMATH A, SINGH M, LECUN Y, et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00180. DOI:10.1109/iccv48922.2021.00180. [46] Shi F, Gao R, Huang W, et al. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [47] Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 106-122. [48] Wang J, Sun A, Zhang H, et al. MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction[J]. arXiv preprint arXiv:2305.18969, 2023. [49] 周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003. [50] 李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机系统, 2022, 44(04): 850-861.

选择文件类型/文献管理软件名称

选择包含的内容