Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Video ViT Adapter for Action Recognition

  

  • Published:2025-10-16

基于视频ViT适配器的动作识别

Abstract: Video understanding tasks face two major challenges: insufficient computational resources and video datasets scarcity. Current video models are massive and computationally intensive, relying on expensive equipment support and lengthy training period, the scarcity dataset also restricts models to train and generalize adequately. To address these problems, an efficient transfer learning method is introduced: the adapter training strategy. By freezing all the weights of the pre-trained Vision Transformer (ViT) model and only fine-tuning the parameters in the adapter, resource consumption can be significantly reduced while fully retaining the representational advantages of the pre-trained model. Based on the adapter training strategy, a hierarchical adapter and ViT backbone network are designed to jointly construct the Video ViT Adapter (VVA) model. The hierarchical adapter employs three spatiotemporal convolutions with different dimensions, which helps to balance the spatiotemporal relationships between details and the global context. Additionally, the Contrastive Language–Image Pre-training (CLIP) model, which possesses strong cross-modal learning capabilities, is introduced as the pre-trained model. This provides the VVA model with rich feature representations, facilitating effective fusion across different data modalities. VVA achieved excellent results on three standard action recognition datasets, with only 9.50M training parameters. Accuracy rates of 79.32% on Kinetics-400, 97.77% on UCF101, and 81.78% on HMDB51 were obtained. Such performance fully demonstrates that the adapter's efficiency and convenience can effectively address and properly resolve the challenges faced.

摘要: 视频理解任务面临两大挑战:计算资源不足和视频数据集稀缺。现有视频模型参数量庞大,训练过程依赖高性能计算设备且耗时长;同时,视频数据集规模相对有限,制约了模型的训练和泛化能力。为解决上述问题,引入了一种高效的迁移学习方法:适配器训练策略。通过冻结预训练视觉转换器(ViT)模型的全部权重,仅微调适配器中参数,能够在显著降低资源消耗的同时,充分保留了预训练模型的表征优势。基于适配器训练策略,设计了层次化适配器与ViT骨干网络共同构建视频ViT适配器(VVA)模型。层次化适配器中使用三个维度不同的时空卷积,有助于兼顾细节和全局的时空关系。此外,引入具有强大跨模态学习能力的对比语言-图像预训练模型(CLIP)作为预训练模型,为VVA模型性能提供丰富的特征表示,促进不同数据模态间的有效融合。实验结果表明,VVA在三个标准动作识别数据集取得了优异的成绩,仅训练参数9.50M,在Kinetics-400、UCF101和HMDB51数据集上分别获得了79.32%、97.77%和81.78%的准确率。上述结果充分证明,适配器的高效性和便利性,能够有效应对并妥善解决所面临的挑战。