[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[J/OL]. Communications of the ACM, 2017: 84-90. http://dx.doi.org/10.1145/3065386. DOI:10.1145/3065386.
[2] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.91. DOI:10.1109/cvpr.2016.91.
[3] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot MultiBox Detector[M/OL]//Computer Vision – ECCV 2016,Lecture Notes in Computer Science. 2016: 21-37. http://dx.doi.org/10.1007/978-3-319-46448-0_2. DOI:10.1007/978-3-319-46448-0_2.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[J]. Neural Information Processing Systems, 2017.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[J]. arXiv: Computer Vision and Pattern Recognition, 2020.
[6] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00986. DOI:10.1109/iccv48922.2021.00986.
[7] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers[M/OL]//Computer Vision – ECCV 2020,Lecture Notes in Computer Science. 2020: 213-229. http://dx.doi.org/10.1007/978-3-030-58452-8_13. DOI:10.1007/978-3-030-58452-8_13.
[8] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.90. DOI:10.1109/cvpr.2016.90.
[9] LIU Y, WANG Y, WANG S, et al. CBNet: A Novel Composite Backbone Network Architecture for Object Detection[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11653-11660. http://dx.doi.org/10.1609/aaai.v34i07.6834. DOI:10.1609/aaai.v34i07.6834.
[10] SUN Z, CAO S, YANG Y, et al. Rethinking Transformer-based Set Prediction for Object Detection[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00359. DOI:10.1109/iccv48922.2021.00359.
[11] GAO P, ZHENG M, WANG X, et al. Fast Convergence of DETR with Spatially Modulated Co-Attention[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00360. DOI:10.1109/iccv48922.2021.00360.
[12] YE M, KE L, LI S, et al. Cascade-DETR: Delving into High-Quality Universal Object Detection[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00617.
[13] ROH B, SHIN J, SHIN W C, et al. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity[J]. Cornell University - arXiv, 2021.
[14] ZHENG D, DONG W, HU H, et al. Less is More: Focus Attention for Efficient DETR[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00614.
[15] 曹健, 陈怡梅, 李海生, 等. 基于深度学习的道路小目标检测综述[J]. 计算机工程, 2023: 17.
[16] 董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27.
[17] Zhang J, Huang J, Luo Z, et al. DA-DETR: Domain Adaptive Detection Transformer with Information Fusion[J]. arXiv preprint arXiv:2103.17084, 2021.
[18] WANG T, YUAN L, CHEN Y, et al. PnP-DETR: Towards Efficient Visual Analysis with Transformers[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00462. DOI:10.1109/iccv48922.2021.00462.
[19] Zhang C, Liu L, Zang X, et al. Detr++: Taming your multi-scale detection transformer[J]. arXiv preprint arXiv:2206.02977, 2022.
[20] TAN M, PANG R, LE Q V. EfficientDet: Scalable and Efficient Object Detection[C/OL]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. http://dx.doi.org/10.1109/cvpr42600.2020.01079. DOI:10.1109/cvpr42600.2020.01079.
[21] Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 6748-6758.
[22] YAO Z, AI J, LI B, et al. Efficient DETR: Improving End-to-End Object Detector with Dense Prior[J]. Cornell University - arXiv, 2021.
[23] Meng D, Chen X, Fan Z, et al. Conditional detr for fast training convergence[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3651-3660.
[24] Chen X, Wei F, Zeng G, et al. Conditional detr v2: Efficient detection transformer with box queries[J]. arXiv preprint arXiv:2207.08914, 2022.
[25] Wang Y, Zhang X, Yang T, et al. Anchor detr: Query design for transformer-based detector[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2567-2575.
[26] Liu S, Li F, Zhang H, et al. Dab-detr: Dynamic anchor boxes are better queries for detr[J]. arXiv preprint arXiv:2201.12329, 2022.
[27] LIU Y, ZHANG Y, WANG Y, et al. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency[J]. 2022.
[28] Li F, Zhang H, Liu S, et al. Dn-detr: Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 13619-13627.
[29] ZHANG H, LI F, LIU S, et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J]. arXiv preprint arXiv:2203.03605, 2022.
[30] CHEN Q, CHEN X, WANG J, et al. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment[J]. 2022.
[31] JIA D, YUAN Y, HE H, et al. DETRs with Hybrid Matching[C/OL]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. DOI:10.1109/cvpr52729.2023.01887.
[32] 潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述[J]. 中国图象图形学报, 2023, 28(09): 2587-2615.
[33] 陈洛轩, 林成创, 郑招良, 等. Transformer在计算机视觉场景下的研究综述[J]. 计算机科学, 2023: 29 4] Li F, Zeng A, Liu S, et al. Lite DETR: An interleaved multi-scale encoder for efficient detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18558-18567.
[35] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection[J]. arXiv preprint arXiv:2304.08069, 2023.
[36] ZHANG G, LUO Z, CUI K, et al. Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation[J]. Cornell University - arXiv, 2021.
[37] Bulat A, Guerrero R, Martinez B, et al. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 11793-11802.
[38] RADFORD A, KIM J, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[J]. Cornell University - arXiv, 2021.
[39] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018.
[40] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C/OL]//Proceedings of the 2019 Conference of the North. 2019. http://dx.doi.org/10.18653/v1/n19-1423. DOI:10.18653/v1/n19-1423.
[41] DAI Z, CAI B, LIN Y, et al. Unsupervised Pre-Training for Detection Transformers[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022: 1-11. http://dx.doi.org/10.1109/tpami.2022.3216514. DOI:10.1109/tpami.2022.3216514.
[42] CARON M, MISRA I, MAIRAL J, et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments[J]. Le Centre pour la Communication Scientifique Directe - HAL - Université Paris Descartes, 2020.
[43] Chen Z, Huang G, Li W, et al. Siamese detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15722-15731.
[44] Liu S, Huang S, Li F, et al. DQ-DETR: Dual query detection transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(2): 1728-1736.
[45] KAMATH A, SINGH M, LECUN Y, et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00180. DOI:10.1109/iccv48922.2021.00180.
[46] Shi F, Gao R, Huang W, et al. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[47] Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 106-122.
[48] Wang J, Sun A, Zhang H, et al. MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction[J]. arXiv preprint arXiv:2305.18969, 2023.
[49] 周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003.
[50] 李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks[J/OL]. Communications of the ACM, 2017: 84-90. http://dx.doi.org/10.1145/3065386. DOI:10.1145/3065386.
[2] REDMON J, DIVVALA S, GIRSHICK R, et al. You Only Look Once: Unified, Real-Time Object Detection[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.91. DOI:10.1109/cvpr.2016.91.
[3] LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single Shot MultiBox Detector[M/OL]//Computer Vision – ECCV 2016,Lecture Notes in Computer Science. 2016: 21-37. http://dx.doi.org/10.1007/978-3-319-46448-0_2. DOI:10.1007/978-3-319-46448-0_2.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[J]. Neural Information Processing Systems, 2017.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[J]. arXiv: Computer Vision and Pattern Recognition, 2020.
[6] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00986. DOI:10.1109/iccv48922.2021.00986.
[7] CARION N, MASSA F, SYNNAEVE G, et al. End-to-End Object Detection with Transformers[M/OL]//Computer Vision – ECCV 2020,Lecture Notes in Computer Science. 2020: 213-229. http://dx.doi.org/10.1007/978-3-030-58452-8_13. DOI:10.1007/978-3-030-58452-8_13.
[8] HE K, ZHANG X, REN S, et al. Deep Residual Learning for Image Recognition[C/OL]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. http://dx.doi.org/10.1109/cvpr.2016.90. DOI:10.1109/cvpr.2016.90.
[9] LIU Y, WANG Y, WANG S, et al. CBNet: A Novel Composite Backbone Network Architecture for Object Detection[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11653-11660. http://dx.doi.org/10.1609/aaai.v34i07.6834. DOI:10.1609/aaai.v34i07.6834.
[10] SUN Z, CAO S, YANG Y, et al. Rethinking Transformer-based Set Prediction for Object Detection[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00359. DOI:10.1109/iccv48922.2021.00359.
[11] GAO P, ZHENG M, WANG X, et al. Fast Convergence of DETR with Spatially Modulated Co-Attention[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00360. DOI:10.1109/iccv48922.2021.00360.
[12] YE M, KE L, LI S, et al. Cascade-DETR: Delving into High-Quality Universal Object Detection[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00617.
[13] ROH B, SHIN J, SHIN W C, et al. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity[J]. Cornell University - arXiv, 2021.
[14] ZHENG D, DONG W, HU H, et al. Less is More: Focus Attention for Efficient DETR[C/OL]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023. DOI:10.1109/iccv51070.2023.00614.
[15] 曹健, 陈怡梅, 李海生, 等. 基于深度学习的道路小目标检测综述[J]. 计算机工程, 2023: 17.
[16] 董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27.
[17] Zhang J, Huang J, Luo Z, et al. DA-DETR: Domain Adaptive Detection Transformer with Information Fusion[J]. arXiv preprint arXiv:2103.17084, 2021..
[18] WANG T, YUAN L, CHEN Y, et al. PnP-DETR: Towards Efficient Visual Analysis with Transformers[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00462. DOI:10.1109/iccv48922.2021.00462.
[19] Zhang C, Liu L, Zang X, et al. Detr++: Taming your multi-scale detection transformer[J]. arXiv preprint arXiv:2206.02977, 2022.
[20] TAN M, PANG R, LE Q V. EfficientDet: Scalable and Efficient Object Detection[C/OL]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. http://dx.doi.org/10.1109/cvpr42600.2020.01079. DOI:10.1109/cvpr42600.2020.01079.
[21] Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 6748-6758.
[22] YAO Z, AI J, LI B, et al. Efficient DETR: Improving End-to-End Object Detector with Dense Prior[J]. Cornell University - arXiv, 2021.
[23] Meng D, Chen X, Fan Z, et al. Conditional detr for fast training convergence[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 3651-3660.
[24] Chen X, Wei F, Zeng G, et al. Conditional detr v2: Efficient detection transformer with box queries[J]. arXiv preprint arXiv:2207.08914, 2022.
[25] Wang Y, Zhang X, Yang T, et al. Anchor detr: Query design for transformer-based detector[C]//Proceedings of the AAAI conference on artificial intelligence. 2022, 36(3): 2567-2575.
[26] Liu S, Li F, Zhang H, et al. Dab-detr: Dynamic anchor boxes are better queries for detr[J]. arXiv preprint arXiv:2201.12329, 2022.
[27] LIU Y, ZHANG Y, WANG Y, et al. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency[J]. 2022.
[28] Li F, Zhang H, Liu S, et al. Dn-detr: Accelerate detr training by introducing query denoising[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 13619-13627.
[29] ZHANG H, LI F, LIU S, et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection[J]. arXiv preprint arXiv:2203.03605, 2022.
[30] CHEN Q, CHEN X, WANG J, et al. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment[J]. 2022.
[31] JIA D, YUAN Y, HE H, et al. DETRs with Hybrid Matching[C/OL]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. DOI:10.1109/cvpr52729.2023.01887.
[32] 潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述[J]. 中国图象图形学报, 2023, 28(09): 2587-2615.
[33] 陈洛轩, 林成创, 郑招良, 等. Transformer在计算机视觉场景下的研究综述[J]. 计算机科学, 2023: 29.
[34] Li F, Zeng A, Liu S, et al. Lite DETR: An interleaved multi-scale encoder for efficient detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18558-18567.
[35] Zhao Y, Lv W, Xu S, et al. Detrs beat yolos on real-time object detection[J]. arXiv preprint arXiv:2304.08069, 2023.
[36] ZHANG G, LUO Z, CUI K, et al. Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation[J]. Cornell University - arXiv, 2021.
[37] Bulat A, Guerrero R, Martinez B, et al. FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 11793-11802.
[38] RADFORD A, KIM J, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[J]. Cornell University - arXiv, 2021.
[39] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018.
[40] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C/OL]//Proceedings of the 2019 Conference of the North. 2019. http://dx.doi.org/10.18653/v1/n19-1423. DOI:10.18653/v1/n19-1423.
[41] DAI Z, CAI B, LIN Y, et al. Unsupervised Pre-Training for Detection Transformers[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022: 1-11. http://dx.doi.org/10.1109/tpami.2022.3216514. DOI:10.1109/tpami.2022.3216514.
[42] CARON M, MISRA I, MAIRAL J, et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments[J]. Le Centre pour la Communication Scientifique Directe - HAL - Université Paris Descartes, 2020.
[43] Chen Z, Huang G, Li W, et al. Siamese detr[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15722-15731.
[44] Liu S, Huang S, Li F, et al. DQ-DETR: Dual query detection transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(2): 1728-1736.
[45] KAMATH A, SINGH M, LECUN Y, et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding[C/OL]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. http://dx.doi.org/10.1109/iccv48922.2021.00180. DOI:10.1109/iccv48922.2021.00180.
[46] Shi F, Gao R, Huang W, et al. Dynamic mdetr: A dynamic multimodal transformer decoder for visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[47] Zang Y, Li W, Zhou K, et al. Open-vocabulary detr with conditional matching[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 106-122.
[48] Wang J, Sun A, Zhang H, et al. MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction[J]. arXiv preprint arXiv:2305.18969, 2023.
[49] 周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003.
[50] 李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机系统, 2022, 44(04): 850-861. |