[1] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale[C] //Proceedings of the International Conference on Learning Representations. Virtual Conference:ICLR, 2021.
[2] Wang W, Xie E, Li X, et al. Pyramid Vision Transformer:A Versatile Backbone for Dense Prediction without Convolutions[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville:IEEE, 2021:548-558.
[3] Liu X, Peng H, Zheng N, et al. EfficientViT:Memory-efficient vision transformer with cascaded group attention[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2023:14420-14430.
[4] Vasu P K A, Gabriel J, Zhu J, et al. FastViT:A Fast Hybrid Vision Transformer using Structural Reparameterization[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2023:5762-5772.
[5] Fan Q, Huang H, Guan J, et al. Rethinking Local Perception in Lightweight Vision Transformer[J]. arXiv preprint arXiv:2306. 12345, 2023.
[6] Setyawan N, Kurniawan G W, Sun C C, et al. ParFormer:A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding[J]. arXiv preprint arXiv:2403. 15004, 2024.
[7] Liu Z, Lin Y T, Cao Y, et al. Swin Transformer:Hierarchical Vision Transformer using Shifted Windows[C] //Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal:IEEE, 2021:9992-10002.
[8] Dong X, Bao J, Chen D, et al. CSWin Transformer:A General Vision Transformer Backbone with Cross-Shaped Windows[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2022:12114-12124.
[9] 白尚旺, 王梦瑶, 胡静, 陈志泊. 多区域注意力的细粒度图像分类网络[J]. 计算机工程, 2024, 50(1): 271-278.
Shangwang BAI, Mengyao WANG, Jing HU, Zhibo CHEN. Multi-Region Attention Network for Fine-Grained Image Classification[J]. Computer Engineering, 2024, 50(1): 271-278.
[10] Han D C, Pan X R, Han Y Z, et al. Flatten Transformer:Vision Transformer using Focused Linear Attention[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2023:5938-5948.
[11] Han D, Ye T, Han Y, et al. Agent Attention:On the Integration of Softmax and Linear Attention[J]. arXiv preprint arXiv:2301. 12345, 2023.
[12] Shaker A, Maaz M, Rasheed H, et al. SwiftFormer:Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2022:17379-17390.
[13] Yao T, Li Y H, Pan Y W, et al. HIRI-ViT:Scaling Vision Transformer With High Resolution Inputs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46:6431-6442.
[14] Wang W, Xie E, Li X, et al. PVT v2:Improved Baselines with Pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8:415-424.
[15] Ding X, Zhang Y, Ge Y, et al. Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 5513-5524.
[16] Woo S, Debnath S, Hu R, et al. ConvNeXt V2:Co-designing and Scaling ConvNets with Masked Autoencoders[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2023:16133-16142.
[17] Yu W, Si C, Zhou P, et al. Metaformer baselines for vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 46(2): 896-912.
[18] Rao Y, Zhao W, Zhu Z, et al. GFNet: Global filter networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10960-10973.
[19] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 248-255.
[20] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[R]. Toronto: University of Toronto, 2009.
[21] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Springer, Cham, 2014: 740-755.
[22] Loshchilov I, Hutter F. Decoupled Weight Decay Regularization[J]. arXiv preprint arXiv:1711. 05101, 2017.
[23] Loshchilov I, Hutter F. SGDR:Stochastic Gradient Descent with Warm Restarts[J]. arXiv preprint arXiv:1608. 03983, 2016.
[24] He K, Gkioxari G, Dollár P, et al. Mask R-CNN[C] // Proceedings of the IEEE International Conference on Computer Vision. Venice:IEEE, 2017:2961-2969.
[25] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C] // Proceedings of the IEEE International Conference on Computer Vision. Venice:IEEE, 2017:2980-2988.
[26] Setyawan N, Sun C C, Hsu M H, et al. MicroViT: a vision transformer with low complexity self attention for edge device[C]//2025 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2025: 1-5.
[27] Li Y, Zhang K, Cao J, et al. LocalViT:Bringing locality to vision transformers[J]. arXiv preprint arXiv:2104. 05707, 2021.
[28] Yun S, Ro Y. ShViT:Single-head vision transformer with memory efficient macro design[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2024:5756-5767.
[29] Zhang J, Li X, Wang Y, et al. EATFormer:Improving vision transformer inspired by evolutionary algorithm[J]. International Journal of Computer Vision, 2024, 132(9):3509-3536.
[30] Vasu, P. K. A., Gabriel, J., Zhu, J., Tuzel, O., & Ranjan, A. (2023). Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7907-7917).dings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 7907-7917.
[31] Wang A, Chen H, Lin Z, et al. LSNet: See Large, Focus Small[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 9718-9729.
[32] Wang A, Chen H, Lin Z, et al. Repvit: Revisiting mobile cnn from vit perspective[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 15909-15920.
[33] Zhu L, Wang X, Ke Z, et al. Biformer: Vision transformer with bi-level routing attention[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 10323-10333.
[34] Howard A G, Zhu M, Chen B, et al. MobileNets:Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704. 04861, 2017.
[35] Sandler M, Howard A, Zhu M, et al. MobileNetV2:Inverted residuals and linear bottlenecks[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018:4510-4520.
[36] Howard A, Sandler M, Chu G, et al. Searching for MobileNetV3[C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach:IEEE, 2019:1314-1324.
[37] 2024.崔学英, 樊如龙, 靳黎忠, 上官宏, 张雄. 双知识蒸馏结合多尺度特征学习的图像分类Transformer模型[J]. 计算机工程与应用, DOI: 10.3778/j.issn.1002-8331.2503-0119.
CUI Xueying, FAN Rulong, JIN Lizhong, SHANGGUAN Hong, ZHANG Xiong. Transformer for Image Classification with Knowledge Distillation and Multi-Scale Feature Learning[J]. Computer Engineering and Applications, DOI: 10.3778/j.issn.1002-8331.2503-0119.
|