Lightweight ViT with Orientation and Frequency Awareness

doi:10.19678/j.issn.1000-3428.0253444

Abstract

Abstract: Addressing the limitations of existing lightweight Vision Transformers (ViTs), specifically the lack of explicit structural and spectral priors during token construction which leads to the loss of local high-frequency details and constrained representation efficiency, this paper proposes a novel framework named OFT-Former (Orientation- and Frequency-Aware Token Interaction Transformer). First, an Orientation-Aware Patch Embedding (OAPE) module is designed to explicitly inject horizontal and vertical spatial structural priors during initialization, thereby mitigating the insufficient geometric perception inherent in traditional embedding methods. Second, a Frequency-Enhanced Token Refinement (FETR) module is proposed, which leverages Fast Fourier Transform (FFT) to decouple frequency-domain features and integrates multi-scale convolutions to specifically enhance the preservation of high-frequency details. Furthermore, a Bidirectional Gated Token Modulation (BGTM) mechanism is constructed to establish bidirectional interaction pathways between local and global features, facilitating adaptive fusion of cross-scale representations via dynamic gating. Experimental results demonstrate that OFT-Former achieves a Top-1 accuracy of 81.4% on ImageNet-1K with only 12.8M parameters and 1.8 GFLOPs. Additionally, the model exhibits superior performance on CIFAR-100 classification and COCO object detection tasks, verifying the effectiveness of the proposed method.

摘要： 针对现有轻量级视觉Transformer在词元构建阶段缺乏显式结构先验与频域先验，导致局部高频细节丢失及表征效率受限的问题，本文提出一种融合方位与频域感知的轻量级模型，称为OFT-Former。首先，设计方位感知块嵌入模块，在词元构建阶段显式引入水平与垂直方向的空间结构先验，有效弥补传统块嵌入在几何信息捕捉方面的不足。其次，构建频域增强词元表征细化模块，利用快速傅里叶变换实现频域特征解耦，并结合多尺度卷积针对性强化高频细节保留。进一步，提出双向门控词元调制机制，建立局部与全局特征间的双向交互通路，通过动态门控实现跨尺度特征的自适应融合。实验结果表明，OFT-Former在ImageNet-1K上以12.8M的参数量和1.8 GFLOPs的计算开销取得了81.4%的Top-1准确率，在CIFAR-100分类与COCO目标检测与实例分割任务中亦表现优异，充分验证了模型的有效性。

Li Zongmin, Wang Xingyu , Ma Jinyue, Bai Yun. Lightweight ViT with Orientation and Frequency Awareness[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253444.

李宗民, 王兴宇 , 马金悦, 白云. 融合方位与频域感知的轻量级ViT[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253444.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0253444

References

[1] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words：Transformers for Image Recognition at Scale[C] //Proceedings of the International Conference on Learning Representations. Virtual Conference：ICLR, 2021.
[2] Wang W, Xie E, Li X, et al. Pyramid Vision Transformer：A Versatile Backbone for Dense Prediction without Convolutions[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville：IEEE, 2021：548-558.
[3] Liu X, Peng H, Zheng N, et al. EfficientViT：Memory-efficient vision transformer with cascaded group attention[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2023：14420-14430.
[4] Vasu P K A, Gabriel J, Zhu J, et al. FastViT：A Fast Hybrid Vision Transformer using Structural Reparameterization[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2023：5762-5772.
[5] Fan Q, Huang H, Guan J, et al. Rethinking Local Perception in Lightweight Vision Transformer[J]. arXiv preprint arXiv:2306. 12345, 2023.
[6] Setyawan N, Kurniawan G W, Sun C C, et al. ParFormer：A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding[J]. arXiv preprint arXiv:2403. 15004, 2024.
[7] Liu Z, Lin Y T, Cao Y, et al. Swin Transformer：Hierarchical Vision Transformer using Shifted Windows[C] //Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal：IEEE, 2021：9992-10002.
[8] Dong X, Bao J, Chen D, et al. CSWin Transformer：A General Vision Transformer Backbone with Cross-Shaped Windows[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2022：12114-12124.
[9] 白尚旺, 王梦瑶, 胡静, 陈志泊. 多区域注意力的细粒度图像分类网络[J]. 计算机工程, 2024, 50(1): 271-278. Shangwang BAI, Mengyao WANG, Jing HU, Zhibo CHEN. Multi-Region Attention Network for Fine-Grained Image Classification[J]. Computer Engineering, 2024, 50(1): 271-278.
[10] Han D C, Pan X R, Han Y Z, et al. Flatten Transformer：Vision Transformer using Focused Linear Attention[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2023：5938-5948.
[11] Han D, Ye T, Han Y, et al. Agent Attention：On the Integration of Softmax and Linear Attention[J]. arXiv preprint arXiv:2301. 12345, 2023.
[12] Shaker A, Maaz M, Rasheed H, et al. SwiftFormer：Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications[C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2022：17379-17390.
[13] Yao T, Li Y H, Pan Y W, et al. HIRI-ViT：Scaling Vision Transformer With High Resolution Inputs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46：6431-6442.
[14] Wang W, Xie E, Li X, et al. PVT v2：Improved Baselines with Pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8：415-424.
[15] Ding X, Zhang Y, Ge Y, et al. Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 5513-5524.
[16] Woo S, Debnath S, Hu R, et al. ConvNeXt V2：Co-designing and Scaling ConvNets with Masked Autoencoders[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2023：16133-16142.
[17] Yu W, Si C, Zhou P, et al. Metaformer baselines for vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 46(2): 896-912.
[18] Rao Y, Zhao W, Zhu Z, et al. GFNet: Global filter networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10960-10973.
[19] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 248-255.
[20] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[R]. Toronto: University of Toronto, 2009.
[21] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Springer, Cham, 2014: 740-755.
[22] Loshchilov I, Hutter F. Decoupled Weight Decay Regularization[J]. arXiv preprint arXiv:1711. 05101, 2017.
[23] Loshchilov I, Hutter F. SGDR：Stochastic Gradient Descent with Warm Restarts[J]. arXiv preprint arXiv:1608. 03983, 2016.
[24] He K, Gkioxari G, Dollár P, et al. Mask R-CNN[C] // Proceedings of the IEEE International Conference on Computer Vision. Venice：IEEE, 2017：2961-2969.
[25] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C] // Proceedings of the IEEE International Conference on Computer Vision. Venice：IEEE, 2017：2980-2988.
[26] Setyawan N, Sun C C, Hsu M H, et al. MicroViT: a vision transformer with low complexity self attention for edge device[C]//2025 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2025: 1-5.
[27] Li Y, Zhang K, Cao J, et al. LocalViT：Bringing locality to vision transformers[J]. arXiv preprint arXiv:2104. 05707, 2021.
[28] Yun S, Ro Y. ShViT：Single-head vision transformer with memory efficient macro design[C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans：IEEE, 2024：5756-5767.
[29] Zhang J, Li X, Wang Y, et al. EATFormer：Improving vision transformer inspired by evolutionary algorithm[J]. International Journal of Computer Vision, 2024, 132(9)：3509-3536.
[30] Vasu, P. K. A., Gabriel, J., Zhu, J., Tuzel, O., & Ranjan, A. (2023). Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7907-7917).dings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 7907-7917.
[31] Wang A, Chen H, Lin Z, et al. LSNet: See Large, Focus Small[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 9718-9729.
[32] Wang A, Chen H, Lin Z, et al. Repvit: Revisiting mobile cnn from vit perspective[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 15909-15920.
[33] Zhu L, Wang X, Ke Z, et al. Biformer: Vision transformer with bi-level routing attention[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 10323-10333.
[34] Howard A G, Zhu M, Chen B, et al. MobileNets：Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704. 04861, 2017.
[35] Sandler M, Howard A, Zhu M, et al. MobileNetV2：Inverted residuals and linear bottlenecks[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City：IEEE, 2018：4510-4520.
[36] Howard A, Sandler M, Chu G, et al. Searching for MobileNetV3[C] // Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach：IEEE, 2019：1314-1324.
[37] 2024.崔学英, 樊如龙, 靳黎忠, 上官宏, 张雄. 双知识蒸馏结合多尺度特征学习的图像分类Transformer模型[J]. 计算机工程与应用, DOI: 10.3778/j.issn.1002-8331.2503-0119.
CUI Xueying, FAN Rulong, JIN Lizhong, SHANGGUAN Hong, ZHANG Xiong. Transformer for Image Classification with Knowledge Distillation and Multi-Scale Feature Learning[J]. Computer Engineering and Applications, DOI: 10.3778/j.issn.1002-8331.2503-0119.

Please choose a citation manager

Content to export