Video ViT Adapter for Action Recognition

doi:10.19678/j.issn.1000-3428.0252608

Abstract

Abstract: Video understanding tasks face two major challenges: insufficient computational resources and video datasets scarcity. Current video models are massive and computationally intensive, relying on expensive equipment support and lengthy training period, the scarcity dataset also restricts models to train and generalize adequately. To address these problems, an efficient transfer learning method is introduced: the adapter training strategy. By freezing all the weights of the pre-trained Vision Transformer (ViT) model and only fine-tuning the parameters in the adapter, resource consumption can be significantly reduced while fully retaining the representational advantages of the pre-trained model. Based on the adapter training strategy, a hierarchical adapter and ViT backbone network are designed to jointly construct the Video ViT Adapter (VVA) model. The hierarchical adapter employs three spatiotemporal convolutions with different dimensions, which helps to balance the spatiotemporal relationships between details and the global context. Additionally, the Contrastive Language–Image Pre-training (CLIP) model, which possesses strong cross-modal learning capabilities, is introduced as the pre-trained model. This provides the VVA model with rich feature representations, facilitating effective fusion across different data modalities. VVA achieved excellent results on three standard action recognition datasets, with only 9.50M training parameters. Accuracy rates of 79.32% on Kinetics-400, 97.77% on UCF101, and 81.78% on HMDB51 were obtained. Such performance fully demonstrates that the adapter's efficiency and convenience can effectively address and properly resolve the challenges faced.

摘要： 视频理解任务面临两大挑战：计算资源不足和视频数据集稀缺。现有视频模型参数量庞大，训练过程依赖高性能计算设备且耗时长；同时，视频数据集规模相对有限，制约了模型的训练和泛化能力。为解决上述问题，引入了一种高效的迁移学习方法：适配器训练策略。通过冻结预训练视觉转换器（ViT）模型的全部权重，仅微调适配器中参数，能够在显著降低资源消耗的同时，充分保留了预训练模型的表征优势。基于适配器训练策略，设计了层次化适配器与ViT骨干网络共同构建视频ViT适配器（VVA）模型。层次化适配器中使用三个维度不同的时空卷积，有助于兼顾细节和全局的时空关系。此外，引入具有强大跨模态学习能力的对比语言-图像预训练模型（CLIP）作为预训练模型，为VVA模型性能提供丰富的特征表示，促进不同数据模态间的有效融合。实验结果表明，VVA在三个标准动作识别数据集取得了优异的成绩，仅训练参数9.50M，在Kinetics-400、UCF101和HMDB51数据集上分别获得了79.32%、97.77%和81.78%的准确率。上述结果充分证明，适配器的高效性和便利性，能够有效应对并妥善解决所面临的挑战。

Yue Minghui, He Yuxuan, Ren Yuanxin, ZHANG Liye. Video ViT Adapter for Action Recognition[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252608.

岳明慧, 何宇轩, 任源鑫, 张立晔. 基于视频ViT适配器的动作识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252608.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252608

References

[1]汪威. 基于深度学习的自动扶梯视频人体动作识别[J].软件工程, 2021, 9: 24-27. Wang Wei. Human Action Recognition in Escalator Videos Based on Deep Learning[J]. Software Engineering, 2021, 9: 24-27.
[2]刘国平, 王南星, 周毅, 等. 基于改进ReliefF算法的哑铃动作识别[J].科学技术与工程, 2019, 19(32):6. Liu Guoping, Wang Nanxing, Zhou Yi, et al. Dumbbell Action Recognition Based on an Improved ReliefF Algorithm[J]. Science Technology and Engineering, 2019, 19(32): 6.
[3]席志红, 冯宇. 基于改进型c3d网络的人体行为识别算法[J]. 应用科技, 2021, 7: 47-53. Xi Zhihong, Feng Yu. Human Activity Recognition Algorithm Based on an Improved C3D Network[J]. Applied Science and Technology, 2021, 7: 47-53.
[4]Moradi M, Blagec K, Haberl F, et al. GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain[EB/OL]. [2020-07-01]. https://doi.org/10.48550/arXiv.2109.02555
[5] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter, Minnesota, Association for Computational Linguistics, 2018: 4171–4186.
[6]He K, Chen X, Xie S, et al. Masked Autoencoders Are Scalable Vision Learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 16000-16009.
[7]Luo H, Ji L, Zhong M, et al. Clip4Clip: An empirical study of Clip for end to end video Clip retrieval and captioning[J]. Neurocomputing, 2022, 58: 293-304.
[8]Zolfaghari M, Singh K, Brox T. ECO: Efficient Convolutional Network for Online Video Understanding[J]. Springer, Cham, 2018, 11206: 713-730.
[9]Yin D, Hu L, Li B, et al. Adapter is All You Need for Tuning Visual Tasks[EB/OL]. [2023-11-25]. https://doi.org/ 10.48550/arXiv.2311.15010.html.
[10]Sung Y L, Cho J, Bansal M. VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks[C].//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 5217-5227.
[11]Chen S, Ge C, Tong Z, et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition[J]. Advances in Neural. Information Processing Systems 35, 2022: 16664-16678.
[12]Zaken, E.B., Ravfogel, S., Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models[C]//Proceedings of the 60th annual meeting of the association for computation for computational linguistics. Dublin, Association for Computational Linguistics, 2022, 2: 1-9.
[13]Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models[EB/OL]. [2021-10-16]. https://doi.org/10.48550/arXiv.2106.09685.
[14]Pan J, Lin Z, Zhu X, et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning[J]. Advances in Neural Information Processing Systems 35. 2022: 26462-26477.
[15]DosoVitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[EB/OL]. [2020-10-22]. https://doi.org/10.48550/ arXiv.2010.11929.html.
[16]Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceeding of the IEEE Computer Vision & Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.
[17] Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision[C]//In International Conference on Machine Learning. New York, USA: ACM Press, 2021: 8748-8763.
[18] Xu H, Ghosh G, Huang P Y, et al. VideoClip: Contrastive Pre-training for Zero-shot Video-Text Understanding[EB/OL]. [2021-9-28]. https://doi.org/ 10.48550/arXiv.2109.14084.html.
[19] Bhattacharjee A, Moitra A, Panda P. ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 4: 592-601.
[20] SZEGEDY, Christian, et al. Going deeper with convolutions[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Washington D.C., USA: IEEE Press, 2015: 1-9.
[21] KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON, Geoffrey E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 84-90.
[22] Vaswani A, Shazeer N, Parmar N, et
al. Attention Is All You Need[EB/OL].[2017-7-12] https://doi.org/10.48550/ arXiv.1706.03762.html. [23] Kay W, Carreira J, Simonyan K, et al. The Kinetics Human Action Video Dataset[EB/OL]. [2017-5-19] https://doi.org/ 10.48550/ arXiv.1705.06950.html.
[24] Soomro K, Zamir A R, Shah M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild[J]. Computer Science, 2012. https://doi.org/10.48550/arXiv. 1212.0402.html.
[25] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A Large Video Database for Human Motion Recognition[C] //Proceedings of the International Conference on Computer Vision. Washington D. C., USA Pressing, 2011: 2556-2563.
[26] D. T, H. W, M. F, et al. Video Classification With Channel-Separated Convolutional Networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2019: 5552-5561.
[27] Feichtenhofer C. X3D: Expanding Architectures for Efficient Video Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2020: 203-213.
[28] Neimark D, Bar O, Zohar M, et al. Video Transformer Network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2021: 3163-3172.
[29] K. Z, M. L, X. G, et al. Temporal Shift Module-Based Vision Transformer Network for Action Recognition[J]. IEEE Access, 2024, 12: 47246-47257.
[30] Zhang H, Hao Y, Ngo C W. Token Shift Transformer for Video Classification[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, Association for Computing Machinery, 2021: 917-925.
[31] Zhang H, Cheng L, Hao Y, et al. Long-term Leap Attention, Short-term Periodic Shift for Video Classification[C]//Proceedings of the 30th Acm International Conference on Multimedia. New York, Association for Computing Machinery, 2022: 5773-5782.
[32] Hao, Yanbin, et al. Group Contextualization for Video Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2022: 928-938.
[33] Chen, Yatong, et al. AGPN: Action Granularity Pyramid Network for Video Action Recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(8): 3912-3923.
[34] Ju C, Han T, Zheng K, et al. Prompting Visual-Language Models for Efficient Video Understanding[C].//In: European Conference on Computer Vision. Cham: Springer Nature Switzerland ,2021: 105-124.
[35] H. Ga SHENG X X, LI K C, SHEN Z Q, et al. A progressive difference method for capturing visual tempos on action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(3): 977-987.
[36] Rasheed H, Khattak M U, Maaz M, et al. Fine-tuned clip models are efficient video learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Washington D. C., USA Pressing,. 2023: 6545-6554.
[37] Q. Wang, Q. Hu, Z. Gao et al. AMS-Net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition[EB/OL] [2023-10-13], doi: 10.119/TNNLS.2023.3321141.
[38] M. Wang, J. Xing, J. Mei, Y. Liu and Y. Jiang, ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 625-637 doi: 10.1109/TNNLS.2023.3331841.
[39] H Gao, -S. Xie, R. Yan, Q. Cui et al. Hierarchical Motion-Enhanced Matching Framework for Few-Shot Action Recognition[J] IEEE Transactions on Multimedia， 2025,27: 2450-2462.
[40] Z. Li and X. Ping,et al. Open-Vocabulary Action Recognition with Masked Visual Prompt and Verb Semantic Reconstruction[C]//2025 6th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 2025: 01-06, doi: 10.1109/ICCEA65460.2025.11103216.
[41] Wang L, Tong Z, Ji B, et al. TDN: Temporal Difference Networks for Efficient Action Recognition: Computer Vision and Pattern Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2021: 1895-1904.
[42] Li Y, Ji B, Shi X, et al. TEA: Temporal Excitation and Aggregation for Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2020: 909-918
. [43] Jiang B, Yan J, Wang M, et al. STM: Spatio-Temporal and Motion Encoding for Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA Pressing, 2019: 2000-2009.
[44] Zhu L, Tran D, Sevilla-Lara L, et al. FASTER Recurrent Networks for Efficient Video Classification[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, AAAI, 2020, 34(7): 13098-13105.
[45] Duan H, Zhao Y, Xiong Y, et al. Omni-sourced Webly-supervised Learning for Video Recognition[C]//In: European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2020: 670-688.
[46] 王晓路, 汶建荣. 基于运动-时间感知的人体动作识别方法[J]. 计算机工程, 2025,51(1): 216-224. WANG Xiaolu, WEN Jianrong. Human Action Recognition Method Based on Action-Time Perception[J]. Computer Engineering, 2025, 51(1): 216-224.
[47] 龚安, 赵宗泽, 张贵临. 多模态交叉注意力融合的视频动作识别[J].信息技术, 2025,(06):70-75+80. DOI:10.13274/j.cnki.hdzj.2025.06.012. Gong A., Zhao Z., Zhang G. (2025). Multimodal Cross-Attention Fusion for Video Action Recognition. Information Technology, (06), 70–75+80. DOI: 10.13274/j.cnki.hdzj.2025.06.012.
[48] 孙凯铭. 基于注意力机制的时空融合动作识别方法[D].大连交通大学, 2024. DOI:10.26990/d.cnki.gsltc.2024.000356. Sun K. Spatiotemporal Fusion Action Recognition Method Based on Attention Mechanism[D]. Dalian Jiaotong University, 2024. DOI: 10.26990/d.cnki.gsltc.2024.000356.
[49] Y. He, Y. Yang, C. Li, J. Huanget al. Video Human Action Recognition and Classification Based on Channel Attention and LSTM Transformer[C]//2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 2025: 552-558. doi: 10.1109/ICAIBD64986.2025.11082043.
[50] 王小伟,沈燕飞,邢庆君.参数高效化微调的双分支视频动作识别方法[J].河南理工大学学报(自然科学版), 2025, 44(04):21-28.DOI:10.16186/j.cnki.1673-9787.2025020018. Wang X, Shen Y, Xing Q. A Dual-Branch Video Action Recognition Method Based on Parameter-Efficient Fine-Tuning[J]. Journal of Henan Polytechnic University (Natural Science), 2025, 44(04): 21–28. DOI: 10.16186/j.cnki.1673-9787.2025020018.

Please choose a citation manager

Content to export