基于可学习词向量的双尺度文本提示在手部网格重建中的应用

doi:10.19678/j.issn.1000-3428.0260301

摘要/Abstract

摘要： 近年来，基于RGB图像的手部网格重建受到广泛关注。现有方法主要依赖堆叠复杂视觉模块来提升重建精度，但会导致较高的计算开销，难以满足实时应用需求。针对这一问题，本文在训练阶段引入自然语言信息，将高层先验知识注入网络，从而增强视觉特征表达能力。由于文本分支仅在训练阶段参与监督，因此不会增加主体网络的参数量，保证模型的实时性。为更有效地增强视觉表征，本文提出双尺度文本生成模块，从全局与局部两个层面对手部特征进行描述。全局文本提示基于各手指弯曲程度对手部整体姿态进行建模，局部文本提示则依据各关节点的空间位置信息对手部局部特征进行描述，并利用对比学习约束多尺度文本特征与图像特征在公共语义空间中的一致性。考虑到CLIP模型对文本表述较为敏感，手工设计提示词往往需要大量调试，且难以保证其能够充分匹配图像特征。为此，本文设计“固定文本提示+可学习词向量”的组合方式，其中固定文本提示用于概括主要语义信息，可学习词向量用于对提示进行自适应微调，以提高文本描述对手部网格重建任务的适配性。实验结果表明，与实时性方法相比，本文方法在保持实时性的同时取得了出色的重建精度。在 FreiHAND 数据集上，PA-MPJPE和PA-MPVPE指标分别达到5.5mm和5.8mm；在DexYCB数据集上，分别达到5.4mm和5.2mm；推理速度达到68fps。消融实验表明，双尺度文本提示在手部网格重建中发挥了关键作用。

Abstract: In recent years, RGB-based hand mesh reconstruction has attracted extensive attention. Existing methods mainly rely on stacking complex visual modules to improve reconstruction accuracy, but this often incurs high computational cost and makes it difficult to satisfy the requirements of real-time applications. To address this issue, this paper introduces natural language information during training, injecting high-level prior knowledge into the network to enhance visual feature representation. Since the text branch is used only for supervision during training, it does not increase the number of parameters of the main network, thereby preserving real-time performance. To further enhance visual representation, a dual-scale text generation module is proposed to describe hand features from both global and local perspectives. Specifically, the global text prompt models the overall hand pose based on the bending degree of each finger, while the local text prompt describes local hand features according to the spatial positions of individual joints. In addition, contrastive learning is employed to enforce consistency between multi-scale text features and image features in a shared semantic space. Considering that the CLIP model is highly sensitive to textual formulation, manually designing prompts usually requires extensive tuning and still cannot guarantee sufficient alignment with image features. To this end, this paper adopts a combination of fixed text prompts and learnable word vectors, where the fixed prompts are used to summarize the main semantic information, and the learnable word vectors are used to adaptively refine the prompts, thereby improving the suitability of the text descriptions for the hand mesh reconstruction task. Experimental results show that, compared with real-time methods, the proposed method achieves excellent reconstruction accuracy while maintaining real-time performance. On the FreiHAND dataset, the PA-MPJPE and PA-MPVPE reach 5.5 mm and 5.8 mm, respectively; on the DexYCB dataset, they reach 5.4 mm and 5.2 mm, respectively. The inference speed reaches 68 fps. Ablation studies further demonstrate that the dual-scale text prompts play a key role in hand mesh reconstruction.

曹麒, 李少东, 卢帅延, 张哲浩, 杨国凯. 基于可学习词向量的双尺度文本提示在手部网格重建中的应用[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260301.

CAO Qi, LI Shaodong, LU Shuaiyan, ZHANG Zhehao, YANG Guokai. Dual-Scale Text Prompt with Learnable Word Vectors for Hand Mesh Reconstruction[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260301.

参考文献

[1] 刘健, 尤晨晨, 曹金明, 等. 人手抓取物体的三维数据集的建立及应用[J]. 计算机应用, 2024, 44(1): 278-284.
[2] 马六. 基于OpenCV的手势与遥感图像交互处理系统[J]. 物联网技术, 2025, 15(23): 65-68. DOI: 10.16667/j.issn.2095-1302.2025.23.014.
[3] 陈征, 李晋江. 基于多尺度特征融合的双分支手部姿态估计算法[J]. 计算机工程与设计, 2024, 45(10): 3059-3065. DOI: 10.16208/j.issn1000-7024.2024.10.023.
[4] RAUTARAY S, AGRAWAL A. Vision based hand gesture recognition for human computer interaction: a survey[J]. Artificial intelligence review, 2015, 43(1): 1-54.
[5] LIANG X, ANGELOPOULOU A, KAPETANIOS E, et al. A multi-modal machine learning approach and toolkit to automate recognition of early stages of dementia among british sign language users[C]//European Conference on Computer Vision, 2020: 278-293.
[6] ROMERO J, TZIONAS D, BLACK M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics, 2017, 36(6): 1-17.
[7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[8] KIPF T. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv: 1609.02907, 2016.
[9] RADFORD A, KIM J, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning, 2021: 8748-8763.
[10] XU J, DE MELLO S, LIU S, et al. Groupvit: Semantic segmentation emerges from text supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 18134-18144.
[11] HE W, JAMONNAK S, GOU L, et al. Clip-s4: Language-guided self-supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023: 11207-11216.
[12] ZHONG Y, YANG J, ZHANG P, et al. Regionclip: Region-based language-image pretraining[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 16793-16803.
[13] XU H, GHOSH G, HUANG P, et al. Videoclip: Contrastive pre-training for zero-shot video-text understanding[J]. arXiv preprint arXiv: 2109.14084, 2021.
[14] LUO H, JI L, ZHONG M, et al. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022, 508: 293-304.
[15] WANG M, XING J, LIU Y. Actionclip: A new paradigm for video action recognition[J]. arXiv preprint arXiv: 2109.08472, 2021.
[16] CHEN G, YAO W, SONG X, LI X, RAO Y, ZHANG K. Prompt learning with optimal transport for vision-language models[C]//International Conference on Learning Representations, 2023.
[17] ZHOU K, YANG J, LOY C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348.
[18] ZHOU K, YANG J, LOY C, et al. Conditional prompt learning for vision-language models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 16816-16825.
[19] ZIMMERMANN C, CEYLAN D, YANG J, et al. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2019: 813-822.
[20] CHAO Y, YANG W, XIANG Y, et al. DexYCB: A benchmark for capturing hand grasping of objects[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 9044-9053.
[21] BOUKHAYMA A, BEM R, TORR P. 3d hand shape and pose from images in the wild[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019: 10843-10852.
[22] ZHANG X, LI Q, MO H, et al. End-to-end hand mesh recovery from a monocular rgb image[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2019: 2354-2364.
[23] LIN K, WANG L, LIU Z. End-to-end human pose and mesh reconstruction with transformers[
C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 1954-1963. [24] JIANG C, XIAO Y, WU C, et al. A2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 8846-8855.
[25] GE L, REN Z, LI Y, et al. 3d hand shape and pose estimation from a single rgb image[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019: 10833-10842.
[26] LIN K, WANG L, LIU Z. Mesh graphormer[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2021: 12939-12948.
[27] KIM J, GWON M, PARK H, et al. Sampling is matter: Point-guided 3d human mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023: 12880-12889.
[28] VASU P K A, GABRIEL J, ZHU J, TUZEL O, RANJAN A. FastViT: a fast hybrid vision transformer using structural reparameterization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 5785-5795.
[29] CHEN X, LIU Y, MA C, et al. Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 13274-13283.
[30] CHEN X, LIU Y, DONG Y, et al. MobRecon: mobile-friendly hand mesh reconstruction from monocular image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20544-20554.
[31] CHO J, YOUWANG K, OH T H. FastMETRO: cross-attention of disentangled modalities for 3D human mesh recovery with transformers[C]//European Conference on Computer Vision, 2022: 342-359.
[32] ZHOU Z, ZHOU S, LV Z, et al. A simple baseline for efficient hand mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 1367-1376.
[33] JIAO Z, WANG X, XIA Z, et al. HandS3C: 3D hand mesh reconstruction with state space spatial channel attention from RGB images[C]//2025 IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India: IEEE, 2025: 1-5.
[34] AN S, DAI S, ANSARI M, et al. ReJSHand: efficient real-time hand pose estimation and mesh reconstruction using refined joint and skeleton features[J]. arXiv preprint arXiv:2503.05995, 2025.
[35] LEE S, PARK H, KIM D, et al. Image-free domain generalization via clip for 3d hand pose estimation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023: 2934-2944.
[36] GUO S, CAI Q, QI L, et al. CLIP-Hand3D: exploiting 3D hand pose estimation via context-aware prompting[C]//Proceedings of the 31st ACM International Conference on Multimedia, 2023: 4896-4907.
[37] PARK J, KONG K, KANG S. AttentionHand: text-driven controllable hand image generation for 3D hand reconstruction in the wild[C]//European Conference on Computer Vision, 2024: 329-345.
[38] CHA J, KIM J, YOON J S, et al. Text2HOI: text-guided 3D motion generation for hand-object interaction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: Computer Vision Foundation/IEEE, 2024: 1577-1585.
[39] CHRISTEN S, HAMPALI S, SENER F, et al. DiffH2O: diffusion-based synthesis of hand-object interactions from textual descriptions[C]//SIGGRAPH Asia 2024 Conference Papers, 2024: 1-11. [40] ZHANG W, HUANG M, ZHOU Y, et al. BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 2393-2404.
[41] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv: 2010.11929, 2020.
[42] DEVLIN J, CHANG M, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019: 4171-4186.
[43] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv: 1807.03748, 2018.
[44] PARK J, OH Y, MOON G, et al. HandOccNet: occlusion-robust 3D hand mesh estimation network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 1496-1505.
[45] XU H, WANG T, TANG X, et al. H2ONet: hand-occlusion-and-orientation-aware network for real-time 3D hand mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 17048-17058.
[46] LIN Z, DING C, YAO H, et al. Harmonious feature learning for interactive hand-object pose estimation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023: 12989-12998.
[47] WANG S, WANG S, YANG D, et al. HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images[C]//2023 IEEE International Conference on Multimedia and Expo. Brisbane, Australia: IEEE, 2023: 2495-2500.
[48] WANG Y, XU H, HENG P A, et al. UniHOPE: a unified approach for hand-only and hand-object pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA: Computer Vision Foundation/IEEE, 2025: 12231-12241.

选择文件类型/文献管理软件名称

选择包含的内容