[1] Redmon J, Divvala S, Girshick R, et al. You only
look once: Unified, real-time object
detection[C]//Proceedings of the IEEE
conference on computer vision and pattern
recognition. 2016: 779-788.
[2] Liu W, Anguelov D, Erhan D, et al. Ssd: Single
shot multibox detector[C]//Computer
Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14,
2016, Proceedings, Part I 14. Springer
International Publishing, 2016: 21-37.
[3] Ren S, He K, Girshick R, et al. Faster r-cnn:
Towards real-time object detection with region
proposal networks[J]. Advances in neural
information processing systems, 2015, 28.
[4] He K, Gkioxari G, Dollár P, et al. Mask
r-cnn[C]//Proceedings of the IEEE international
conference on computer vision. 2017:
2961-2969.
[5] Redmon J, Farhadi A. Yolov3: An incremental
improvement[J]. arXiv preprint
arXiv:1804.02767, 2018.
[6] Carion N, Massa F, Synnaeve G, et al.
End-to-end object detection with
transformers[C]//European conference on
computer vision. Cham: Springer International
Publishing, 2020: 213-229.
[7] Everingham M, Van Gool L, Williams C K I, et al.
The pascal visual object classes (voc) challenge[J].
International journal of computer vision, 2010,
88: 303-338.
[8] Liu L, Ouyang W, Wang X, et al. Deep learning
for generic object detection: A survey[J].
International journal of computer vision, 2020,
128: 261-318.
[9] Ulku I, Akagündüz E. A survey on deep
learning-based architectures for semantic
segmentation on 2d images[J]. Applied ArtificialIntelligence, 2022, 36(1): 2032924.
[10] Zaidi S S A, Ansari M S, Aslam A, et al. A survey
of modern deep learning based object detection
models[J]. Digital Signal Processing, 2022, 126:
103514.
[11] Arkin E, Yadikar N, Muhtar Y, et al. A survey of
object detection based on CNN and
transformer[C]//2021 IEEE 2nd international
conference on pattern recognition and machine
learning (PRML). IEEE, 2021: 99-108.
[12] Khan S, Naseer M, Hayat M, et al. Transformers
in vision: A survey[J]. ACM computing surveys
(CSUR), 2022, 54(10s): 1-41.
[13] Arkin E, Yadikar N, Xu X, et al. A survey: Object
detection methods from CNN to transformer[J].
Multimedia Tools and Applications, 2023, 82(14):
21353-21383.
[14] Enzweiler M, Gavrila D M. Monocular pedestrian
detection: Survey and experiments[J]. IEEE
transactions on pattern analysis and machine
intelligence, 2008, 31(12): 2179-2195.
[15] Cheng G, Han J. A survey on object detection in
optical remote sensing images[J]. ISPRS journal
of photogrammetry and remote sensing, 2016,
117: 11-28.
[16] Chaudhari S, Mithal V, Polatkan G, et al. An
attentive survey of attention models[J]. ACM
Transactions on Intelligent Systems and
Technology (TIST), 2021, 12(5): 1-32.
[17] 李 清 格 , 杨 小 冈 , 卢 瑞 涛 等 . 计 算 机 视 觉 中 的
Transformer 发 展 综 述 [J]. 小 型 微 型 计 算 机 系
统 ,2023,44(04):850-861.DOI:10.20009/j.cnki.2
1-1106/TP.2022-0504.
Li Qingge, Yang Xiaogang, Lu Ruitao, et al. Overview of
Transformer Development in Computer Vision[J]. Journal
of Small and Microcomputers, 2023, 44(04): 850-861.
DOI: 10.20009/j.cnki.21-1106/TP.2022-0504.
[18] 田永林,王雨桐,王建功等.视觉 Transformer 研究的关键
问 题 : 现 状 及 展 望 [J]. 自 动 化 学
报,2022,48(04):957-979.DOI:10.16383/j.aas.c220027.
Tian Yonglin, Wang Yutong, Wang Jiangong, et al. Key
Issues in Visual Transformer Research: Current Status and
Prospects[J]. Acta Automatica Sinica, 2022, 48(04):
957-979. DOI: 10.16383/j.aas.c220027.
[19] 李建,杜建强,朱彦陈等.基于 Transformer 的目标检测算
法综述[J].计算机工程与应用,2023,59(10):48-64.
Li Jian, Du Jianqiang, Zhu Yanchen, et al. Overview of
Transformer-Based Object Detection Algorithms[J].
Computer Engineering and Applications, 2023, 59(10):
48-64.
[20] 刘宇晶.基于 Transformer 的目标检测研究综述[J].计算
机 时
代,2023,(05):6-10.DOI:10.16644/j.cnki.cn33-1094/tp.2023
.05.002.
Liu Yujing. Overview of Transformer-Based Object
Detection Research[J]. Computer Era, 2023, (05): 6-10.
DOI: 10.16644/j.cnki.cn33-1094/tp.2023.05.002.
[21] 樊嵘,马小陆.面向拥挤行人检测的改进 DETR 算法[J].
计算机工程与应用,2023,59(19):159-165.
Fan Rong, Ma Xiaolu. Improved DETR Algorithm for
Crowded Pedestrian Detection[J]. Computer Engineering
and Applications, 2023, 59(19): 159-165.
[22] 李小军,刘颖.DETR 的目标检测算法研究综述及展望[J].
单片机与嵌入式系统应用,2023,23(05):40-42.
Li Xiaojun, Liu Ying. Overview and Prospects of DETR
Object Detection Algorithm Research[J]. Journal of
Microcontrollers & Embedded Systems Applications,
2023, 23(05): 40-42.
[23] 张直政.神经网络的注意力机制研究[D].中国科学技术
大学,2021.DOI:10.27517/d.cnki.gzkju.2021.000623.
Zhang Zhizheng. Research on Attention Mechanism in
Neural Networks[D]. University of Science and
Technology of China, 2021. DOI:
10.27517/d.cnki.gzkju.2021.000623.
[24] 常月.基于自注意力机制的多模态场景分类算法研究[D].
南 京 邮 电 大
学,2022.DOI:10.27251/d.cnki.gnjdc.2022.000645.
Chang Yue. Research on Multimodal Scene Classification
Algorithm Based on Self-Attention Mechanism[D].
Nanjing University of Posts and Telecommunications,
2022. DOI: 10.27251/d.cnki.gnjdc.2022.000645.
[25] 王 科 平 , 张 自 娇 , 杨 艺 等 . 基 于 双 注 意 力 卷 积 及
Transformer 融合的非均匀去雾算法[J/OL].北京邮电大
学学报,1-8[2023-12-19].Wang Keping, Zhang Zijiao, Yang Yi, et al. Non-uniform
Dehazing Algorithm Based on Dual Attention Convolution
and Transformer Fusion[J/OL]. Journal of Beijing
University of Posts and Telecommunications, 1-8
[2023-12-19].
[26] 王 科 平 , 张 自 娇 , 杨 艺 等 . 基 于 双 注 意 力 卷 积 及
Transformer 融合的非均匀去雾算法[J/OL].北京邮电大
学学报,1-8[2023-12-19].
Wang Keping, Zhang Zijiao, Yang Yi, et al. Non-uniform
Dehazing Algorithm Based on Dual Attention Convolution
and Transformer Fusion[J/OL]. Journal of Beijing
University of Posts and Telecommunications, 1-8
[2023-12-19].
[27] 王培森.基于注意力机制的图像分类深度学习方法研究
[D].中国科学技术大学,2018.
Wang Peisen. Research on Deep Learning Methods for
Image Classification Based on Attention Mechanism[D].
University of Science and Technology of China, 2018.
[28] 朱张莉,饶元,吴渊等.注意力机制在深度学习中的研究
进展[J].中文信息学报,2019,33(06):1-11.
Zhu Zhangli, Rao Yuan, Wu Yuan, et al. Research Progress
of Attention Mechanism in Deep Learning[J]. Journal of
Chinese Information Processing, 2019, 33(06): 1-11.
[29] 张宸嘉,朱磊,俞璐.卷积神经网络中的注意力机制综述
[J].计算机工程与应用,2021,57(20):64-72.
Zhang Chenjia, Zhu Lei, Yu Lu. Overview of Attention
Mechanism in Convolutional Neural Networks[J].
Computer Engineering and Applications, 2021, 57(20):
64-72.
[30] 任 欢 , 王 旭 光 . 注 意 力 机 制 综 述 [J]. 计 算 机 应
用,2021,41(S1):1-6.
Ren Huan, Wang Xuguang. Overview of Attention
Mechanism[J]. Journal of Computer Applications, 2021,
41(S1): 1-6.
[31] 刘文婷,卢新明.基于计算机视觉的Transformer研究进展
[J].计算机工程与应用,2022,58(06):1-16.
Liu Wenting, Lu Xinming. Research Progress of
Transformer in Computer Vision[J]. Computer
Engineering and Applications, 2022, 58(06): 1-16.
[32] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J.,
and Zisserman, A. (2010). The PASCAL Visual Object
Classes (VOC) Challenge. International Journal of
Computer Vision (IJCV), 88(2), 303-338.
[33] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Doll{'a}r, P., & Zitnick, C. L. (2014).
Microsoft COCO: Common Objects in Context. European
conference on computer vision. Springer.
[34] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei,
L. (2009). Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision
and pattern recognition (pp. 248–255). IEEE.
[35] He K, Zhang X, Ren S, et al. Deep residual learning for
image recognition[C]//Proceedings of the IEEE conference
on computer vision and pattern recognition. 2016:
770-778.
[36] Zheng S, Lu J, Zhao H, et al. Rethinking semantic
segmentation from a sequence-to-sequence perspective
with transformers[C]//Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition.
2021: 6881-6890.
[37] Xie E, Wang W, Yu Z, et al. SegFormer: Simple and
efficient design for semantic segmentation with
transformers[J]. Advances in Neural Information
Processing Systems, 2021, 34: 12077-12090.
[38] Reed S, Akata Z, Yan X, et al. Generative adversarial text
to image synthesis[C]//International conference on
machine learning. PMLR, 2016: 1060-1069.
[39] Zhang H, Xu T, Li H, et al. Stackgan: Text to
photo-realistic image synthesis with stacked generative
adversarial networks[C]//Proceedings of the IEEE
international conference on computer vision. 2017:
5907-5915.
[40] Zhang H, Xu T, Li H, et al. Stackgan++: Realistic image
synthesis with stacked generative adversarial networks[J].
IEEE transactions on pattern analysis and machine
intelligence, 2018, 41(8): 1947-1962.
[41] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative
adversarial networks[J]. Communications of the ACM,
2020, 63(11): 139-144.
[42] Chen M, Radford A, Child R, et al. Generative pretraining
from pixels[C]//International conference on machine
learning. PMLR, 2020: 1691-1703.
[43] Esser P, Rombach R, Ommer B. Taming transformers for
high-resolution image synthesis[C]//Proceedings of theIEEE/CVF conference on computer vision and pattern
recognition. 2021: 12873-12883.
[44] Jiang Y, Chang S, Wang Z. Transgan: Two transformers
can make one strong gan[J]. arXiv preprint
arXiv:2102.07074, 2021, 1(3).
[45] Chen M, Radford A, Child R, et al. Generative pretraining
from pixels[C]//International conference on machine
learning. PMLR, 2020: 1691-1703.
[46] Jiang Y, Chang S, Wang Z. Transgan: Two pure
transformers can make one strong gan, and that can scale
up[J]. Advances in Neural Information Processing Systems,
2021, 34: 14745-14758.
[47] Ding M, Yang Z, Hong W, et al. Cogview: Mastering
text-to-image generation via transformers[J]. Advances in
Neural Information Processing Systems, 2021, 34:
19822-19835.
[48] Van Den Oord A, Vinyals O. Neural discrete representation
learning[J]. Advances in neural information processing
systems, 2017, 30.
[49] Liu S, Fan H, Qian S, et al. Hit: Hierarchical transformer
with momentum contrast for video-text
retrieval[C]//Proceedings of the IEEE/CVF International
Conference on Computer Vision. 2021: 11915-11925.
[50] Lin K, Li L, Lin C C, et al. Swinbert: End-to-end
transformers with sparse attention for video
captioning[C]//Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2022:
17949-17958.
[51] Carion N, Massa F, Synnaeve G, et al. End-to-end object
detection with transformers[C]//European conference on
computer vision. Cham: Springer International Publishing,
2020: 213-229.
[52] Girshick R, Donahue J, Darrell T, et al. Rich feature
hierarchies for accurate object detection and semantic
segmentation[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2014: 580-587.
[53] Dai J, Qi H, Xiong Y, et al. Deformable convolutional
networks[C]//Proceedings of the IEEE international
conference on computer vision. 2017: 764-773.
[54] Zhu X, Su W, Lu L, et al. Deformable detr: Deformable
transformers for end-to-end object detection[J]. arXiv
preprint arXiv:2010.04159, 2020.
[55] Dai X, Chen Y, Yang J, et al. Dynamic detr: End-to-end
object detection with dynamic attention[C]//Proceedings of
the IEEE/CVF International Conference on Computer
Vision. 2021: 2988-2997.
[56] Yao Z, Ai J, Li B, et al. Efficient detr: improving
end-to-end object detector with dense prior[J]. arXiv
preprint arXiv:2104.01318, 2021.
[57] Gao P, Zheng M, Wang X, et al. Fast convergence of detr
with spatially modulated co-attention[C]//Proceedings of
the IEEE/CVF international conference on computer
vision. 2021: 3621-3630.
[58] Roh B, Shin J W, Shin W, et al. Sparse detr: Efficient
end-to-end object detection with learnable sparsity[J].
arXiv preprint arXiv:2111.14330, 2021.
[59] Lv W, Xu S, Zhao Y, et al. Detrs beat yolos on real-time
object detection[J]. arXiv preprint arXiv:2304.08069,
2023.
[60] Li F, Zeng A, Liu S, et al. Lite DETR: An interleaved
multi-scale encoder for efficient detr[C]//Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2023: 18558-18567.
[61] Zhang G, Luo Z, Yu Y, et al. Accelerating DETR
convergence via semantic-aligned
matching[C]//Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2022: 949-958.
[62] Cao X, Yuan P, Feng B, et al. Cf-detr: Coarse-to-fine
transformers for end-to-end object
detection[C]//Proceedings of the AAAI Conference on
Artificial Intelligence. 2022, 36(1): 185-193.
[63] Liu S, Li F, Zhang H, et al. Dab-detr: Dynamic anchor
boxes are better queries for detr[J]. arXiv preprint
arXiv:2201.12329, 2022.
[64] Meng D, Chen X, Fan Z, et al. Conditional detr for fast
training convergence[C]//Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021:
3651-3660.
[65] Zhang H, Li F, Liu S, et al. Dino: Detr with improved
denoising anchor boxes for end-to-end object detection[J].
arXiv preprint arXiv:2203.03605, 2022.
[66] Dai Z, Cai B, Lin Y, et al. Up-detr: Unsupervised
pre-training for object detection with
transformers[C]//Proceedings of the IEEE/CVF conferenceon computer vision and pattern recognition. 2021:
1601-1610.
[67] Li F, Zhang H, Liu S, et al. Dn-detr: Accelerate detr
training by introducing query denoising[C]//Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2022: 13619-13627.
[68] Chen Q, Chen X, Wang J, et al. Group detr: Fast detr
training with group-wise one-to-many
assignment[C]//Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023:
6633-6642.
|