[1] Li X, Zhu Y, Wang L. Zeroi2v: Zero-cost adaptation of pre-trained transformers from image to video[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 425-443.
[2] Yang T, Zhu Y, Xie Y, et al. Aim: Adapting image models for efficient video action recognition[EB/OL]. [2023-3-24]. https://doi.org/10.48550/arXiv.2302.03024.
[3] Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences[J]. Minds and machines, 2020, 30(4): 681-694.
[4] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.
[5] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[6] Pan J, Lin Z, Zhu X, et al. St-adapter: Parameter-efficient image-to-video transfer learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 26462-26477.
[7] 龚安, 赵宗泽, 张贵临. 多模态交叉注意力融合的视频动作识别[J].信息技术, 2025,(06):70-75+80. DOI:10.13274/j.cnki.hdzj.2025.06.012.
Gong A., Zhao Z., Zhang G. (2025). Multimodal Cross-Attention Fusion for Video Action Recognition. Information Technology, (06), 70–75+80. DOI: 10.13274/j.cnki.hdzj.2025.06.012.
[8] Jia M, Tang L, Chen B C, et al. Visual prompt tuning[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 709-727.
[9] Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. ICLR, 2022, 1(2): 3.
[10] Zhou K, Yang J, Loy C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348.
[11] Chen S, Ge C, Tong Z, et al. Adaptformer: Adapting vision transformers for scalable visual recognition[J]. Advances in Neural Information Processing Systems, 2022, 35: 16664-16678.
[12] Li K, Wang Y, Gao P, et al. Uniformer: Unified transformer for efficient spatiotemporal representation learning[EB/OL]. [2022-1-12]. https://doi.org/10.48550/arXiv.2201.04676.
[13] Ju C, Han T, Zheng K, et al. Prompting visual-language models for efficient video understanding[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 105-124.
[14] Wang Q, Hu Q, Gao Z, et al. AMS-Net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[15] Mou Y, Jiang X, Xu K, et al. Compressed video action recognition with dual-stream and dual-modal transformer[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 34(5): 3299-3312.
[16] Soufleri E, Ravikumar D, Roy K. Advancing compressed video action recognition through progressive knowledge distillation[J]. arXiv preprint arXiv:2407.02713, 2024.
[17] ZHOU J, MING Y. Frequency-Temporal Feature Integration for Compressed Video Action Recognition[C]//Proceedings of the British Machine Vision Conference (BMVC). [S.l.]: BMVA Press, 2025.
[18] Wang J, Chen D, Luo C, et al. Omnivid: A generative framework for universal video understanding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 18209-18220.
[19] Wang R, Chen D, Wu Z, et al. Bevt: Bert pretraining of video transformers[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 14733-14743.
[20] Pfeiffer J, Kamath A, Rücklé A, et al. Adapterfusion: Non-destructive task composition for transfer learning[EB/OL]. [2023-3-24]. https://doi.org/10.48550/arXiv.2005.00247.
[21] Liu Z, Wang L, Wu W, et al. Tam: Temporal adaptive module for video recognition[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 13708-13718.
[22] 张祖习,张战成,胡伏原.局部与长程时序互补建模的视频动作识别[J/OL].计算机应用,1-10[2025-09-16].https://link.cnki.net/urlid/51.1307.TP.20250725.1050.004.
Zhang Z, Zhang X, Hu F. Video Action Recognition via Local and Long-range Temporal Complementary Modeling[J/OL]. Journal of Computer Applications, 1-10[2025-09-16].https://link.cnki.net/urlid/51.1307.TP.20250725.1050.004.
[23] Kay W, Carreira J, Simonyan K, et al. The Kinetics Human Action Video Dataset[EB/OL]. [2017-5-19] https://doi.org/ 10.48550/ arXiv.1705.06950.html.
[24] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A Large Video Database for Human Motion Recognition[C] //Proceedings of the International Conference on Computer Vision. Washington D. C., USA Pressing, 2011: 2556-2563.
[25] Soomro K, Zamir A R, Shah M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild[J]. Computer Science, 2012. https://doi.org/10.48550/arXiv. 1212.0402.html.
[26] Li X, Li S, Ma M. Interactive and Balanced Multimodal Learning via Cross Attention and Gradient Modulation for Compressed Video Action Recognition[C]//ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025: 1-5.
[27] Wang M, Xing J, Mei J, et al. Actionclip: Adapting language-image pretrained models for video action recognition[J]. IEEE transactions on neural networks and learning systems, 2023.
[28] 王晓路, 汶建荣. 基于运动-时间感知的人体动作识别方法[J]. 计算机工程, 2025, 51(1): 216-224.
WANG Xiaolu, WEN Jianrong. Human Action Recognition Method Based on Action-Time Perception[J]. Computer Engineering, 2025, 51(1): 216-224.
[29] Liu Z, Ning J, Cao Y, et al. Video swin transformer[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 3202-3211.
[30] Yan S, Xiong X, Arnab A, et al. Multiview transformers for video recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 3333-3343.
[31] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Neural Information Processing Systems, 2022.
[32] Wasim S T, Naseer M, Khan S, et al. Vita-clip: Video and text adaptive clip via multimodal prompting[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23034-23044.
[33] Lin Z, Geng S, Zhang R, et al. Frozen clip models are efficient video learners[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 388-404.
[34] Chang C, Lu T, Yao F. MST-Adapter: Multi-Scaled Spatio-Temporal Adapter for Parameter-Efficient Image-to-Video Transfer Learning[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2024.
|