[1] 刘健, 尤晨晨, 曹金明, 等. 人手抓取物体的三维数据集的建立及应用[J]. 计算机应用, 2024, 44(1): 278-284.
[2] 马六. 基于OpenCV的手势与遥感图像交互处理系统[J]. 物联网技术, 2025, 15(23): 65-68. DOI: 10.16667/j.issn.2095-1302.2025.23.014.
[3] 陈征, 李晋江. 基于多尺度特征融合的双分支手部姿态估计算法[J]. 计算机工程与设计, 2024, 45(10): 3059-3065. DOI: 10.16208/j.issn1000-7024.2024.10.023.
[4] RAUTARAY S, AGRAWAL A. Vision based hand gesture recognition for human computer interaction: a survey[J]. Artificial intelligence review, 2015, 43(1): 1-54.
[5] LIANG X, ANGELOPOULOU A, KAPETANIOS E, et al. A multi-modal machine learning approach and toolkit to automate recognition of early stages of dementia among british sign language users[C]//European Conference on Computer Vision, 2020: 278-293.
[6] ROMERO J, TZIONAS D, BLACK M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics, 2017, 36(6): 1-17.
[7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[8] KIPF T. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv: 1609.02907, 2016.
[9] RADFORD A, KIM J, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning, 2021: 8748-8763.
[10] XU J, DE MELLO S, LIU S, et al. Groupvit: Semantic segmentation emerges from text supervision[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 18134-18144.
[11] HE W, JAMONNAK S, GOU L, et al. Clip-s4: Language-guided self-supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023: 11207-11216.
[12] ZHONG Y, YANG J, ZHANG P, et al. Regionclip: Region-based language-image pretraining[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 16793-16803.
[13] XU H, GHOSH G, HUANG P, et al. Videoclip: Contrastive pre-training for zero-shot video-text understanding[J]. arXiv preprint arXiv: 2109.14084, 2021.
[14] LUO H, JI L, ZHONG M, et al. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022, 508: 293-304.
[15] WANG M, XING J, LIU Y. Actionclip: A new paradigm for video action recognition[J]. arXiv preprint arXiv: 2109.08472, 2021.
[16] CHEN G, YAO W, SONG X, LI X, RAO Y, ZHANG K. Prompt learning with optimal transport for vision-language models[C]//International Conference on Learning Representations, 2023.
[17] ZHOU K, YANG J, LOY C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348.
[18] ZHOU K, YANG J, LOY C, et al. Conditional prompt learning for vision-language models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 16816-16825.
[19] ZIMMERMANN C, CEYLAN D, YANG J, et al. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2019: 813-822.
[20] CHAO Y, YANG W, XIANG Y, et al. DexYCB: A benchmark for capturing hand grasping of objects[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 9044-9053.
[21] BOUKHAYMA A, BEM R, TORR P. 3d hand shape and pose from images in the wild[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019: 10843-10852.
[22] ZHANG X, LI Q, MO H, et al. End-to-end hand mesh recovery from a monocular rgb image[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2019: 2354-2364.
[23] LIN K, WANG L, LIU Z. End-to-end human pose and mesh reconstruction with transformers[ C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 1954-1963.
[24] JIANG C, XIAO Y, WU C, et al. A2j-transformer: Anchor-to-joint transformer network for 3d interacting hand pose estimation from a single rgb image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 8846-8855.
[25] GE L, REN Z, LI Y, et al. 3d hand shape and pose estimation from a single rgb image[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019: 10833-10842.
[26] LIN K, WANG L, LIU Z. Mesh graphormer[C]//Proceedings of the IEEE/CVF international conference on computer vision, 2021: 12939-12948.
[27] KIM J, GWON M, PARK H, et al. Sampling is matter: Point-guided 3d human mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023: 12880-12889.
[28] VASU P K A, GABRIEL J, ZHU J, TUZEL O, RANJAN A. FastViT: a fast hybrid vision transformer using structural reparameterization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 5785-5795.
[29] CHEN X, LIU Y, MA C, et al. Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 13274-13283.
[30] CHEN X, LIU Y, DONG Y, et al. MobRecon: mobile-friendly hand mesh reconstruction from monocular image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20544-20554.
[31] CHO J, YOUWANG K, OH T H. FastMETRO: cross-attention of disentangled modalities for 3D human mesh recovery with transformers[C]//European Conference on Computer Vision, 2022: 342-359.
[32] ZHOU Z, ZHOU S, LV Z, et al. A simple baseline for efficient hand mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 1367-1376.
[33] JIAO Z, WANG X, XIA Z, et al. HandS3C: 3D hand mesh reconstruction with state space spatial channel attention from RGB images[C]//2025 IEEE International Conference on Acoustics, Speech and Signal Processing. Hyderabad, India: IEEE, 2025: 1-5.
[34] AN S, DAI S, ANSARI M, et al. ReJSHand: efficient real-time hand pose estimation and mesh reconstruction using refined joint and skeleton features[J]. arXiv preprint arXiv:2503.05995, 2025.
[35] LEE S, PARK H, KIM D, et al. Image-free domain generalization via clip for 3d hand pose estimation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023: 2934-2944.
[36] GUO S, CAI Q, QI L, et al. CLIP-Hand3D: exploiting 3D hand pose estimation via context-aware prompting[C]//Proceedings of the 31st ACM International Conference on Multimedia, 2023: 4896-4907.
[37] PARK J, KONG K, KANG S. AttentionHand: text-driven controllable hand image generation for 3D hand reconstruction in the wild[C]//European Conference on Computer Vision, 2024: 329-345.
[38] CHA J, KIM J, YOON J S, et al. Text2HOI: text-guided 3D motion generation for hand-object interaction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: Computer Vision Foundation/IEEE, 2024: 1577-1585.
[39] CHRISTEN S, HAMPALI S, SENER F, et al. DiffH2O: diffusion-based synthesis of hand-object interactions from textual descriptions[C]//SIGGRAPH Asia 2024 Conference Papers, 2024: 1-11.
[40] ZHANG W, HUANG M, ZHOU Y, et al. BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 2393-2404.
[41] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv: 2010.11929, 2020.
[42] DEVLIN J, CHANG M, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019: 4171-4186.
[43] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv: 1807.03748, 2018.
[44] PARK J, OH Y, MOON G, et al. HandOccNet: occlusion-robust 3D hand mesh estimation network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 1496-1505.
[45] XU H, WANG T, TANG X, et al. H2ONet: hand-occlusion-and-orientation-aware network for real-time 3D hand mesh reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 17048-17058.
[46] LIN Z, DING C, YAO H, et al. Harmonious feature learning for interactive hand-object pose estimation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023: 12989-12998.
[47] WANG S, WANG S, YANG D, et al. HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images[C]//2023 IEEE International Conference on Multimedia and Expo. Brisbane, Australia: IEEE, 2023: 2495-2500.
[48] WANG Y, XU H, HENG P A, et al. UniHOPE: a unified approach for hand-only and hand-object pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA: Computer Vision Foundation/IEEE, 2025: 12231-12241.
|