作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于可学习词向量的双尺度文本提示在手部网格重建中的应用

  • 发布日期:2026-05-15

Dual-Scale Text Prompt with Learnable Word Vectors for Hand Mesh Reconstruction

  • Published:2026-05-15

摘要: 近年来,基于RGB图像的手部网格重建受到广泛关注。现有方法主要依赖堆叠复杂视觉模块来提升重建精度,但会导致较高的计算开销,难以满足实时应用需求。针对这一问题,本文在训练阶段引入自然语言信息,将高层先验知识注入网络,从而增强视觉特征表达能力。由于文本分支仅在训练阶段参与监督,因此不会增加主体网络的参数量,保证模型的实时性。为更有效地增强视觉表征,本文提出双尺度文本生成模块,从全局与局部两个层面对手部特征进行描述。全局文本提示基于各手指弯曲程度对手部整体姿态进行建模,局部文本提示则依据各关节点的空间位置信息对手部局部特征进行描述,并利用对比学习约束多尺度文本特征与图像特征在公共语义空间中的一致性。考虑到CLIP模型对文本表述较为敏感,手工设计提示词往往需要大量调试,且难以保证其能够充分匹配图像特征。为此,本文设计“固定文本提示+可学习词向量”的组合方式,其中固定文本提示用于概括主要语义信息,可学习词向量用于对提示进行自适应微调,以提高文本描述对手部网格重建任务的适配性。实验结果表明,与实时性方法相比,本文方法在保持实时性的同时取得了出色的重建精度。在 FreiHAND 数据集上,PA-MPJPE和PA-MPVPE指标分别达到5.5mm和5.8mm;在DexYCB数据集上,分别达到5.4mm和5.2mm;推理速度达到68fps。消融实验表明,双尺度文本提示在手部网格重建中发挥了关键作用。

Abstract: In recent years, RGB-based hand mesh reconstruction has attracted extensive attention. Existing methods mainly rely on stacking complex visual modules to improve reconstruction accuracy, but this often incurs high computational cost and makes it difficult to satisfy the requirements of real-time applications. To address this issue, this paper introduces natural language information during training, injecting high-level prior knowledge into the network to enhance visual feature representation. Since the text branch is used only for supervision during training, it does not increase the number of parameters of the main network, thereby preserving real-time performance. To further enhance visual representation, a dual-scale text generation module is proposed to describe hand features from both global and local perspectives. Specifically, the global text prompt models the overall hand pose based on the bending degree of each finger, while the local text prompt describes local hand features according to the spatial positions of individual joints. In addition, contrastive learning is employed to enforce consistency between multi-scale text features and image features in a shared semantic space. Considering that the CLIP model is highly sensitive to textual formulation, manually designing prompts usually requires extensive tuning and still cannot guarantee sufficient alignment with image features. To this end, this paper adopts a combination of fixed text prompts and learnable word vectors, where the fixed prompts are used to summarize the main semantic information, and the learnable word vectors are used to adaptively refine the prompts, thereby improving the suitability of the text descriptions for the hand mesh reconstruction task. Experimental results show that, compared with real-time methods, the proposed method achieves excellent reconstruction accuracy while maintaining real-time performance. On the FreiHAND dataset, the PA-MPJPE and PA-MPVPE reach 5.5 mm and 5.8 mm, respectively; on the DexYCB dataset, they reach 5.4 mm and 5.2 mm, respectively. The inference speed reaches 68 fps. Ablation studies further demonstrate that the dual-scale text prompts play a key role in hand mesh reconstruction.