基于多模态协同增强的微姿态识别方法

doi:10.19678/j.issn.1000-3428.0252355

摘要/Abstract

摘要： 微姿态是由内在情感驱动的无意识细微动作，能够反映个体隐藏情绪，在情感计算中具有重要价值。其在时间维度上具有瞬时性，在空间维度上幅度微小、边界模糊，属于典型的细粒度行为，传统方法难以提取有效特征。为此，本文提出一种基于多模态协同增强的微姿态识别方法，将视频、骨架与文本构建为互补表征三元组。该框架突破传统视觉—语言模型的局限性，引入骨架模态作为运动学先验，结合视觉上下文和语义引导，构建多源互补的特征表征体系。此外，提出双层级协同模块：视频—姿态协同模块（VPCM），融合视频的细节特征与骨架的全局运动信息，采用跨时间注意力机制扩展特征表示，增强时序建模能力；文本—姿态协同模块（TPCM），引入文本模态的语义先验，采用基于Top-K的融合策略强化骨架特征的语义关联性，提升对细粒度特征的捕获效果。为进一步优化多模态融合性能，提出两阶段训练策略—先对单模态编码器进行预训练，再通过轻量化适配器与协同模块进行协同学习，有效提升了模型的精度。在主流微姿态数据集上的实验表明，本模型的识别准确率超越了当前最优方法，达到了70.40%的精度。

Abstract: Micro-gestures are subtle, unconscious movements driven by internal emotions, with significant value in affective computing. Due to their transient nature in time and subtle, ambiguous patterns in space, they are difficult to capture using traditional methods. This paper proposes a multi-modal collaborative framework for micro-gesture recognition by integrating video, skeleton, and text as complementary representations. The skeleton modality is introduced as a kinematic prior to bridge visual and semantic gaps. Two collaborative modules are designed: Video-Pose Collaborative Module(VPCM), which fuses visual details with global motion features and uses cross-temporal attention to enhance temporal modeling; Text-Pose Collaborative Module(TPCM), which leverages semantic priors through a Top-K fusion strategy to enhance skeleton-text alignment. A two-stage training strategy was adopted, pre-training unimodal encoders before collaborative learning with lightweight adapters. Experiments show the proposed method achieves 70.40% accuracy, outperforming existing approaches.

刘玉杰, 王一雯. 基于多模态协同增强的微姿态识别方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252355.

Liu Yujie, Wang Yiwen. Micro Gesture Recognition Method Based on Multi-modal Collaborative Enhancement[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252355.

参考文献

[1] Liu Xin, Shi Henglin, Chen Haoyu, et al. imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis[C]//Proceedings actio the IEEE/CVF conference on computer vision and pattern recognition. 2021: 10631-10642.
[2] Aviezer H, Trope Y, Todorov A. Body cues, not facial expressions, discriminate between intense positive and negative emotions[J]. Science, 2012, 338(6111): 1225-1229.
[3] Axtell R E, Fornwald M. Gestures: The do's and taboos of body language around the world[J]. (No Title), 1998.
[4] 王园园,曹慧,王廷蔚.基于多粒度交叉注意力的骨架动程,1-10[2025-07-15].https://doi.org/10.19678/j.issn.1000 3428.0252088. Wang Y, Cao H, Wang T. Skeleton-based action recognition method based on multi-granularity cross attention [J/OL]. Computer Engineering, 1-10 [2025-07-15].https://doi.org/10.19678/j.issn.1000-3428.02 52088.
[5] Yan Sijie, Xiong Yuanjun, Lin Dahua. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
[6] Liu Ziyu, Zhang Hongwen, Chen Zhenghao, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 143-152.
[7] Zhou Bolei, Andonian A, Oliva A, et al. Temporal relational reasoning in videos[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 803-818.
[8] Lin Ji, Gan Chuang, Han Song. Tsm: Temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 7083-7093.
[9] Zhou Yuxuan, Cheng Zhiqi, Li Chao, et al. Hypergraph transformer for skeleton-based action recognition[J]. arXiv preprint arXiv:2211.09590, 2022.
[10] Huang Hexiang, Guo XuPeng, Huang Weipeng, et al. Micro-gesture Classification Based on Ensemble Hypergraph-convolution Transformer[C]//MiGA@ IJCAI. 2023.
[11] Das S, Sharma S, Dai R, et al. Vpn: Learning video-pose embedding for activities of daily living[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 72-90.
[12] Ahn D, Kim S, Hong H, et al. Star-transformer: a spatio-temporal cross attention transformer for human action recognition[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2023: 3330-3339.
[13] Radford A, Kim J W, Hall [16] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[J]. Advances in neural information processing systems, 2014, 27.
[17] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
[18] 张聪聪,何宁,孙琪翔,等.基于注意力机制的 3D DenseNet 人体动作识别方法 [J]. 计算机工程,2021,47(11):313-320.DOI:10.19678/j.issn.1000-3428. 0059640. Zhang C, He N, Sun Q, et al. Human action recognition method based on attention mechanism and 3D DenseNet[J]. Computer Engineering, 2021, 47(11): 313-320. DOI:10.19678/j.issn.1000-3428.0059640.
[19] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308.
[20] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
[21] Cheng Qin,Cheng Jun Cheng, Liu Zhen, et al. A dense-sparse complementary network for human action recognition based on RGB and skeleton modalities[J]. Expert Systems with Applications, 2024, 244: 123061.
[22] Kim S, Ahn D, Ko B C. Cross-modal learning with 3D deformable attention recognition[C]//Proceedings for of the action IEEE/CVF international conference on computer vision. 2023: 10265-10275.
[23] Sigurdsson G A, Gupta A, Schmid C, et al. Charades-ego: A large-scale dataset of paired third and first person videos[J]. arXiv preprint arXiv:1804.09626, 2018.
[24] Pan Junting, Lin Ziyi, Zhu Xiatian, et al. St-adapter: Parameter-efficient image-to-video transfer learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 26462-26477.
[25] Duan Haodong, Zhao Yue, Chen Kai, et al. Revisiting skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 2969-2978.
[26] Wang Y, Dong Z, Li P, et al. A Multimodal Micro-gesture Classification Model Based on CLIP[C]//MiGA@ IJCAI, 2024.
[27] Cao Kaidi, Wei Colin, Gaidon A, et al. Learning imbalanced datasets with label-distribution-aware margin loss[J]. Advances in neural information processing systems, 2019, 32.
[28] Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]//2011 International conference on computer vision. IEEE, 2011: 2556-2563.
[29] Lin Ziyi, Geng Shijie, Zhang Renrui, et al. Frozen clip models are efficient video learners[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 388-404.
[30] Li Xinhao, Zhu Yuhan, Wang Limin. Zeroi2v: Zero-cost adaptation of pre-trained transformers from image to video[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 425-443.
[31] Xie Saining, Sun Chen, Huang Jonathan, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 305-321.
[32] Ju Chen, Han Tengda, Zheng Kunhao, et al. Prompting visual-language models for efficient video understanding[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 105-124.
[33] Zhao Zhiyu, Huang Bingkun, Xing Sen, et al. Asymmetric masked distillation for pre-training small foundation models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 18516-18526

选择文件类型/文献管理软件名称

选择包含的内容