作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于多模态协同增强的微姿态识别方法

  • 发布日期:2025-08-27

Micro Gesture Recognition Method Based on Multi-modal Collaborative Enhancement

  • Published:2025-08-27

摘要: 微姿态是由内在情感驱动的无意识细微动作,能够反映个体隐藏情绪,在情感计算中具有重要价值。其在时间维度上具有瞬时性,在空间维度上幅度微小、边界模糊,属于典型的细粒度行为,传统方法难以提取有效特征。为此,本文提出一种基于多模态协同增强的微姿态识别方法,将视频、骨架与文本构建为互补表征三元组。该框架突破传统视觉—语言模型的局限性,引入骨架模态作为运动学先验,结合视觉上下文和语义引导,构建多源互补的特征表征体系。此外,提出双层级协同模块:视频—姿态协同模块(VPCM),融合视频的细节特征与骨架的全局运动信息,采用跨时间注意力机制扩展特征表示,增强时序建模能力;文本—姿态协同模块(TPCM),引入文本模态的语义先验,采用基于Top-K的融合策略强化骨架特征的语义关联性,提升对细粒度特征的捕获效果。为进一步优化多模态融合性能,提出两阶段训练策略—先对单模态编码器进行预训练,再通过轻量化适配器与协同模块进行协同学习,有效提升了模型的精度。在主流微姿态数据集上的实验表明,本模型的识别准确率超越了当前最优方法,达到了70.40%的精度。

Abstract: Micro-gestures are subtle, unconscious movements driven by internal emotions, with significant value in affective computing. Due to their transient nature in time and subtle, ambiguous patterns in space, they are difficult to capture using traditional methods. This paper proposes a multi-modal collaborative framework for micro-gesture recognition by integrating video, skeleton, and text as complementary representations. The skeleton modality is introduced as a kinematic prior to bridge visual and semantic gaps. Two collaborative modules are designed: Video-Pose Collaborative Module(VPCM), which fuses visual details with global motion features and uses cross-temporal attention to enhance temporal modeling; Text-Pose Collaborative Module(TPCM), which leverages semantic priors through a Top-K fusion strategy to enhance skeleton-text alignment. A two-stage training strategy was adopted, pre-training unimodal encoders before collaborative learning with lightweight adapters. Experiments show the proposed method achieves 70.40% accuracy, outperforming existing approaches.