基于双鉴别器和时空自校准的零样本骨架动作识别

doi:10.19678/j.issn.1000-3428.0252717

摘要/Abstract

摘要： 基于骨架的零样本动作识别任务借助的是文本标签描述信息和骨架动作信息来对可见类别与未见类别的动作进行区分。现有的方法通常受到视觉特征生成质量不高问题的限制，无法准确对齐语义造成在相似动作的识别上效果欠佳。为了解决这个问题，本文提出了基于双鉴别器和时空自校准的方法（DD-STSC）来探索视觉语义对齐。该方法通过变分自编码器和生成对抗网络的结合，利用鉴别器和生成器进行对抗训练，挖掘不同特征间的差异化信息，同时在解纠缠中更好的分离出有用信息与无用信息，以此进一步提升生成样本的质量。此外，还引入了动作自校准模块（ASCM），通过在时空方向对骨架信息进行学习可以更有效地获得需要的关键运动信息，从而提高分类任务的准确率。在公开数据集NTU60、NTU120、PKU51上进行了实验，结果表明所提出的方法优于现有主流的方法。

Abstract: Zero-shot skeleton-based action recognition uses text label descriptions and skeleton action sequences to distinguish visible and unseen categories of actions. Existing methods are usually limited by the problem of low generation quality in visual feature, so we cannot accurately align semantic, resulting in poor performance in identifying similar actions. To address this issue, this paper proposes a method based on dual discriminators and spatiotemporal self-calibration (DD-STSC) to explore visual semantic alignment. This method combines variational autoencoders and generative adversarial networks, using discriminators and generators for adversarial training to mine the differential information among different features. At the same time, it better separates useful information from useless information during disentanglement, thereby further improving the quality of generated samples. In addition, this paper introduces action self- calibration module(ASCM). By learning the skeleton information in the spatiotemporal direction, the required key motion information can be obtained more effectively, so as to improve the accuracy of classification tasks. Experiments on several widely available datasets NTU60, NTU120, and pku51 demonstrate that the proposed method outperforms the existing mainstream methods.

王泽宇, 吉根林, 朱炜. 基于双鉴别器和时空自校准的零样本骨架动作识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252717.

WANG Zeyu , JI Genlin, ZHU Wei. Zero-Shot Skeleton Action Recognition via Dual Discriminators and Spatiotemporal Self-Calibration[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252717.

参考文献

[1].N. Siddiqui, P. Tirupattur, and M. Shah. DVANet: Disentangling View and Action Features for Multi-View Action Recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024: 4873-4881.
[2].孟祥璞,李硕,苑明哲,等.基于人体骨架的动作识别：综述与展望[J].信息与控制,2025,54(01):1-27. Meng Xiangpu, Li Shuo, Yuan Mingzhe, et al. Action Recognition Based on Human Skeleton: Review and Prospect [J]. Information and Control, 2020,54(01):1-27.
[3].A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-Gutiérrez. Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks[J]. arXiv preprint arXiv:2006.07744, 2020.
[4].Y. Ben-Shabat, O. Shrout, and S. Gould. 3dinaction: Understanding human actions in 3d point clouds[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 19978-19987.
[5].吕露露,黄毅,高君宇,等.多模态零样本人体动作识别[J].中国图象图形学报,2021,26(07):1658-1667. Multimodal Zero-shot Human Motion recognition [J]. Journal of Image and Graphics,2021,26(07):1658-1667.
[6].Z. Chen, Y. Luo, R. Qiu, S. Wang, Z. Huang, J. Li, and Z. Zhang. Semantics disentangling for generalized zero-shot learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 8712-8720.
[7].Z. Han, Z. Fu, S. Chen, and J. Yang. Contrastive embedding for generalized zero-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2371-2381.
[8].Z. Wang, J. Liang, R. He, N. Xu, Z. Wang, and T. Tan. Improving zero-shot generalization for clip with synthesized prompts[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3032-3042.
[9].张海涛,苏琳.结合知识图谱的变分自编码器零样本图像识别[J].计算机工程与应用,2023,59(01):236-243. Zhang Haitao, Su Lin Zero-shot Image recognition with variational autoencoder based on Knowledge Graph [J]. Computer Engineering and Applications,2023,59(01):236-243.
[10].Y. Ye, Y. He, T. Pan, J. Li, and H. T. Shen. Alleviating domain shift via discriminative learning for generalized zero-shot learning[J]. IEEE Transactions on Multimedia, 2021, 23: 1325-1337.
[11].D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, and L. Shao. Out-of-distribution detection for generalized zero-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9985-9993.
[12].B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling. Expanding language-image pretrained models for general video recognition[C]//European Conference on Computer Vision. 2022: 1-18.
[13].J. Gao, Y. Hou, Z. Guo, and H. Zheng. Learning spatio-temporal semantics and cluster relation for zero-shot action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(7): 6519-6530.
[14].L. Momeni, M. Caron, A. Nagrani, A. Zisserman, and C. Schmid. Verbs in action: Improving verb understanding in video-language models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 15579-15591.
[15].K. Cheng, Y. Zhang, C. Cao, et al. Decoupling gcn with dropgraph module for skeleton-based action recognition.[C]// European Conference on Computer Vision, 2020: 536-553.
[16].H. T. Gao, R. H. Jiang, Z. Dong, et al. Spatial-temporal-decoupled masked pre-training for spatiotemporal forecasting. [C]//Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024: 3998-4006.
[17].J. e, Y. Meng, Y. Zhao, et al. Dynamic Semantic-Based Spatial-Temporal Graph Convolution Network for Skeleton-Based Human Action Recognition.[J] IEEE Transactions on Image Processing, 2024.
[18].H. Zheng, Y. S. Zhao, B. Zhang, et al. A separable spatial-temporal graph learning approach for skeleton-based action recognition.[J] IEEE Sensors Letters, 2024.
[19].H. Cui, R. Huang, R. Zhang, et al. Dstsa-gcn: Advancing skeleton-based gesture recognition with semantic-aware spatio-temporal topology modeling.[J] Neurocomputing, 637:130066, 2025.
[20].A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model[C]//Advances in Neural Information Processing Systems. 2013.
[21].H. Tsai, L. Huang, and R. Salakhutdinov. Learning robust visual-semantic embeddings[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3571-3580.
[22].B. Jasani and A. Mazagonwalla. Skeleton based zero shot action recognition in joint pose-language semantic space[J]. arXiv preprint arXiv:1911.11344, 2019.
[23].M. Wray, D. Larlus, G. Csurka, and D. Damen. Fine-grained action retrieval through multiple parts-of-speech embeddings[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 450-459.
[24].E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata. Generalized zero-and few-shot learning via aligned variational autoencoders[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 8247-8255.
[25].P. Gupta, D. Sharma, and R. K. Sarvadevabhatla. Syntactically guided generative embeddings for zero-shot skeleton action recognition[C]//2021 IEEE International Conference on Image Processing. 2021: 439-443.
[26].Y. Zhou, W. Qiang, A. Rao, N. Lin, B. Su, and J. Wang. Zero-shot skeleton-based action recognition via mutual information estimation and maximization[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 5302-5310.
[27].S.-W. Li, Z.-X. Wei, W.-J. Chen, Y.-H. Yu, C.-Y. Yang, and J. Y. Hsu. Sa-dvae: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders[C] //European Conference on Computer Vision. 2024: 447-462.
[28].M.-Z. Li, Z. Jia, Z. Zhang, Z. Ma, and L. Wang. Multi-semantic Fusion Model For Generalized Zero-Shot Skeleton-Based Action Recognition[C]//International Conference on Image and Graphics. 2023: 68-80.
[29].Y. Chen, J. Guo, T. He, X. Lu, and L. Wang. Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 778-786.
[30].A. Zhu, Q. Ke, M. Gong, and J. Bailey. Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 18761-18770.

选择文件类型/文献管理软件名称

选择包含的内容