[1] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [2] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 779-788. [3] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of ECCV’16. Berlin, Germany: Springer International Publishing, 2016: 21-37. [4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].[2024-01-14]. https://arxiv.org/abs/1706.03762. [5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL].[2024-01-14]. https://arxiv.org/abs/2010.11929. [6] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 9992-10002. [7] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of ECCV’20. Berlin, Germany: Springer International Publishing, 2020: 213-229. [8] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 770-778. [9] LIU Y D, WANG Y T, WANG S W, et al. CBNet: a novel composite backbone network architecture for object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 11653-11660. [10] SUN Z Q, CAO S C, YANG Y M, et al. Rethinking Transformer-based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3591-3600. [11] GAO P, ZHENG M H, WANG X G, et al. Fast convergence of DETR with spatially modulated co-attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3601-3610. [12] YE M Q, KE L, LI S Y, et al. Cascade-DETR: delving into high-quality universal object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6681-6691. [13] ROH B, SHIN J, SHIN W C, et al. Sparse DETR: efficient end-to-end object detection with learnable sparsity[EB/OL].[2024-01-14]. https://arxiv.org/abs/2111.14330. [14] ZHENG D H, DONG W H, HU H L, et al. Less is more: focus attention for efficient DETR[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6651-6660. [15] 王国明, 贾代旺. 基于YOLOv8的小目标检测模型的优化[J]. 计算机工程, 2025, 51(12): 294-303. WANG G M, JIA D W. Optimization of small object detection model based on YOLOv8[J]. Computer Engineering, 2025, 51(12): 294-303. (in Chinese) [16] 董刚, 谢维成, 黄小龙, 等. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27. DONG G, XIE W C, HUANG X L, et al. Review of small object detection algorithms based on deep learning[J]. Computer Engineering and Applications, 2023, 59(11): 16-27. (in Chinese) [17] ZHANG J, HUANG J, LUO Z, et al. DA-DETR: domain adaptive detection Transformer with information fusion[EB/OL].[2024-01-14]. https://arxiv.org/abs/2103.17084. [18] WANG T, YUAN L, CHEN Y P, et al. PnP-DETR: towards efficient visual analysis with Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 4641-4650. [19] ZHANG C, LIU L, ZANG X, et al. DETR++: taming your multi-scale detection Transformer[EB/OL].[2024-01-14]. https://arxiv.org/abs/2206.02977. [20] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 1-9. [21] ZONG Z F, SONG G L, LIU Y. DETRs with collaborative hybrid assignments training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6725-6735. [22] YAO Z, AI J, LI B, et al. Efficient DETR: improving end-to-end object detector with dense prior[EB/OL].[2024-01-14]. https://arxiv.org/abs/2104.01318. [23] MENG D P, CHEN X K, FAN Z J, et al. Conditional DETR for fast training convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 3631-3640. [24] CHEN X, WEI F, ZENG G, et al. Conditional DETR V2: efficient detection Transformer with box queries[EB/OL].[2024-01-14]. https://arxiv.org/abs/2207.08914. [25] WANG Y M, ZHANG X Y, YANG T, et al. Anchor DETR: query design for Transformer-based detector[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 2567-2575. [26] LIU S, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL].[2024-01-14]. https://arxiv.org/abs/2201.12329. [27] LIU Y, ZHANG Y, WANG Y X, et al. SAP-DETR: bridging the gap between salient points and queries-based Transformer detector for fast model convergency[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 15539-15547. [28] LI F, ZHANG H, LIU S L, et al. DN-DETR: accelerate DETR training by introducing query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13609-13617. [29] ZHANG H, LI F, LIU S, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL].[2024-01-14]. https://arxiv.org/abs/2203.03605. [30] CHEN Q, CHEN X K, WANG J, et al. Group DETR: fast DETR training with group-wise one-to-many assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 6610-6619. [31] JIA D, YUAN Y H, HE H D, et al. DETRs with hybrid matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 19702-19712. [32] 潘晓英, 贾凝心, 穆元震, 等. 小目标检测研究综述[J]. 中国图象图形学报, 2023, 28(9): 2587-2615. PAN X Y, JIA N X, MU Y Z, et al. Survey of small object detection[J]. Journal of Image and Graphics, 2023, 28(9): 2587-2615. (in Chinese) [33] 王福军, 王星, 王柯迪. 基于双域查询增强Transformer的遥感图像旋转小目标检测[J]. 吉林大学学报(理学版), 2025, 63(5): 1418-1426. WANG F J, WANG X, WANG K D. Rotated small object detection of remote sensing images based on dual-domain query enhanced Transformer[J]. Journal of Jilin University (Science Edition), 2025, 63(5): 1418-1426. (in Chinese) [34] LI F, ZENG A L, LIU S L, et al. Lite DETR: an interleaved multi-scale encoder for efficient DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 18558-18567. [35] ZHAO Y, LÜ W, XU S, et al. DETRs beat YOLOs on real-time object detection[EB/OL].[2024-01-14]. https://arxiv.org/abs/2304.08069. [36] ZHANG G, LUO Z, CUI K, et al. Meta-DETR: image-level few-shot object detection with inter-class correlation exploitation[EB/OL].[2024-01-14]. https://arxiv.org/abs/2103.11731. [37] BULAT A, GUERRERO R, MARTINEZ B, et al. FS-DETR: few-shot detection Transformer with prompting and without re-training[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2023: 11759-11768. [38] RADFORD A, KIM J, HALLACY C, et al. Learning Transferable visual models from natural language supervision[EB/OL].[2024-01-14]. https://arxiv.org/abs/2103.00020. [39] RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL].[2024-01-14]. https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035. [40] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL].[2024-01-14]. https://arxiv.org/abs/1810.04805. [41] DAI Z G, CAI B L, LIN Y G, et al. Unsupervised pre-training for detection Transformers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12772-12782. [42] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments[EB/OL].[2024-01-14]. https://arxiv.org/abs/2006.09882. [43] CHEN Z R, HUANG G S, LI W, et al. Siamese DETR[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 15722-15731. [44] LIU S L, HUANG S J, LI F, et al. DQ-DETR: dual query detection Transformer for phrase extraction and grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2023: 1728-1736. [45] KAMATH A, SINGH M, LECUN Y, et al. MDETR—modulated detection for end-to-end multi-modal understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 1760-1770. [46] SHI F Y, GAO R P, HUANG W L, et al. Dynamic MDETR: a dynamic multimodal Transformer decoder for visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(2): 1181-1198. [47] ZANG Y H, LI W, ZHOU K Y, et al. Open-vocabulary DETR with conditional matching[C]//Proceedings of the European Conference on Computer Vision.Berlin, Germany: Springer Nature Switzerland, 2022: 106-122. [48] WANG J, SUN A, ZHANG H, et al. MS-DETR: natural language video localization with sampling moment-moment interaction[EB/OL].[2024-01-14]. https://arxiv.org/abs/2305.18969. [49] 周丽娟, 毛嘉宁. 视觉Transformer识别任务研究综述[J]. 中国图象图形学报, 2023, 28(10): 2969-3003. ZHOU L J, MAO J N. Vision Transformer-based recognition tasks: a critical review[J]. Journal of Image and Graphics, 2023, 28(10): 2969-3003. (in Chinese) [50] 王杨, 宋世佳, 王鹤琴, 等. 基于改进Vision Transformer的局部光照一致性估计[J]. 计算机工程, 2025, 51(2): 312-321. WANG Y, SONG S J, WANG H Q, et al. Estimation of local illumination consistency based on improved Vision Transformer[J]. Computer Engineering, 2025, 51(2): 312-321. (in Chinese) |