基于关键帧和注意力残差网络的手语识别

doi:10.19678/j.issn.1000-3428.0066523

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 224-230, 242. doi: 10.19678/j.issn.1000-3428.0066523

基于关键帧和注意力残差网络的手语识别

刘群坡¹^,², 盛月琴¹^,²^,*, 高如新¹^,², 卜旭辉¹^,²

1. 河南理工大学电气工程与自动化学院, 河南焦作 454003
2. 河南省智能装备直驱技术与控制国际联合实验室, 河南焦作 454003

收稿日期:2022-12-14 出版日期:2023-12-15 发布日期:2023-03-10
通讯作者: 盛月琴
作者简介:
刘群坡（1978—），男，副教授，主研方向为智能机器人、机器视觉
高如新，副教授、博士
卜旭辉，教授、博士
基金资助:
国家自然科学基金(62273133); 河南省高校科技创新团队项目(20IRTSTHN019); 河南省科技攻关项目(212102210508)

Sign Language Recognition Based on Keyframe and Attention Residual Network

Qunpo LIU¹^,², Yueqin SHENG¹^,²^,*, Ruxin GAO¹^,², Xuhui BU¹^,²

1. School of Electrical Engineering and Automation, Henan Polytechnic University, Jiaozuo 454003, Henan, China
2. International Joint Laboratory of Direct Drive and Control of Intelligent Equipment, Jiaozuo 454003, Henan, China

Received:2022-12-14 Online:2023-12-15 Published:2023-03-10
Contact: Yueqin SHENG

摘要/Abstract

摘要：

手语识别研究对于改善聋哑人生活质量具有重要意义，同时可促进人机交互领域的发展。针对手语视频中存在大量的无关帧、手语识别过程中手部细节信息提取不足、难以精确定位手语动作的位置和时间信息导致识别率不高等问题，提出一种基于关键帧和交互式注意力残差网络的手语识别方法。在数据预处理部分，设计基于图像相似度和模糊程度的关键帧提取算法，从基于Farneback光流法获取的大量候选关键帧中确定最终的关键帧，减少无关冗余信息。在网络部分，以3D-ResNet为基础框架，构建小卷积模块增强网络对手语视频中细粒度特征的提取能力，设计在捷径分支中采用池化卷积下采样方式的残差结构减小特征图失真程度，建立融合通道注意力和空间注意力的交互式四重注意力模块强化对目标区域关键特征的提取。实验结果表明，该方法在CSL和DEVISIGN数据集上取得了92.0%和92.2%的准确率，优于其他手语识别方法。

关键词: 手语识别, 关键帧, 残差网络, 空间注意力, 通道注意力

Abstract:

The study of sign language recognition is crucial for improving the quality of life of deaf-mute people and promoting the development of human-computer interactions. Typically, sign language videos contain numerous irrelevant frames. The extraction of hand details is insufficient for the sign language recognition process. Moreover, the position and time information of sign language movements cannot be accurately located. Thus, this study proposed a sign language recognition method based on keyframes and an interactive attention residual network. In the data preprocessing part, a keyframe extraction algorithm based on image similarity and blur degree is proposed to determine the final keyframes from the several candidate keyframes obtained using the Farneback optical flow method, which reduces irrelevant redundant information. In the network, based on the 3D-ResNet framework, a small convolution module is constructed to replace the first convolution layer of the original 3D-ResNet, which enhances the ability of the network to extract fine-grained features of the hands. Subsequently, the pooling convolution undersampling method is used in the shortcut branch of the residual structure to reduce the distortion degree of the feature map. A quadruplet attention module is designed to extract more effective feature information by integrating channel and spatial attention. Experiments are conducted using the CSL and DEVISIGN datasets. The results show that the method obtains 92.0% and 92.2% accuracy on the CSL and DEVISIGN datasets, respectively, which are higher than those of other sign language recognition methods.

Key words: sign language recognition, keyframe, residual network, spatial attention, channel attention

刘群坡, 盛月琴, 高如新, 卜旭辉. 基于关键帧和注意力残差网络的手语识别[J]. 计算机工程, 2023, 49(12): 224-230, 242.

Qunpo LIU, Yueqin SHENG, Ruxin GAO, Xuhui BU. Sign Language Recognition Based on Keyframe and Attention Residual Network[J]. Computer Engineering, 2023, 49(12): 224-230, 242.

http://www.ecice06.com/CN/Y2023/V49/I12/224

图/表 10

图1 基于关键帧和注意力残差网络的手语识别方法整体框架

Fig.1 Overall framework of sign language recognition method based on keyframe and attention residual network

图2 关键帧提取流程

Fig.2 Procedure of keyframe extraction

图3 小卷积模块

Fig.3 Small convolution module

图4 残差连接方式

Fig.4 Residual connection modes

图5 四重注意力模块

Fig.5 Quadruplet attention module

图6 CSL数据集上的实验结果

Fig.6 Experimental results on the CSL dataset

图7 DEVISIGN数据集上的实验结果

Fig.7 Experimental results on the DEVISIGN dataset

参考文献 25

1	丁小雪. 基于改进CNN+RNN的视频手势识别研究[D]. 合肥: 安徽大学, 2020.
	DING X X. Research on video gesture recognition based on improved CNN+RNN[D]. Hefei: Anhui University, 2020. (in Chinese)
2	PU J F, ZHOU W G, ZHANG J H, et al. Sign language recognition based on trajectory modeling with HMMs[C]//Proceedings of International Conference on Multimedia Modeling. Berlin, Germany: Springer, 2016: 686-697.
3	WANG H J, CHAI X J, CHEN X L. Sparse Observation(SO) alignment for sign language recognition. Neurocomputing, 2016, 175, 674- 685. doi: 10.1016/j.neucom.2015.10.112
4	刘鹏飞, 朱健晨, 万良易, 等. 低功耗异构计算架构的高光谱遥感图像分类研究. 计算机工程, 2022, 48(12): 9-15, 23. URL
	LIU P F, ZHU J C, WAN L Y, et al. Research on hyperspectral remote sensing image classification using low-power heterogeneous computing architecture. Computer Engineering, 2022, 48(12): 9-15, 23. URL
5	韩磊, 高永彬, 史志才. 基于稀疏Transformer的雷达点云三维目标检测. 计算机工程, 2022, 48(11): 104-110, 144. URL
	HAN L, GAO Y B, SHI Z C. Radar point cloud 3D target detection based on sparse Transformer. Computer Engineering, 2022, 48(11): 104-110, 144. URL
6	徐智明, 戚湧. 基于UV贴图优化人体特征的行人重识别. 计算机工程, 2022, 48(11): 83-88, 95. URL
	XU Z M, QI Y. Pedestrian re-recognition based on UV mapping optimization of human features. Computer Engineering, 2022, 48(11): 83-88, 95. URL
7	HOCHREITER S, SCHMIDHUBER J. Long short-term memory. Neural Computation, 1997, 9(8): 1735- 1780. doi: 10.1162/neco.1997.9.8.1735
8	LIU T, ZHOU W G, LI H Q. Sign language recognition with long short-term memory[C]//Proceedings of IEEE International Conference on Image Processing. Washington D. C., USA: IEEE Press, 2016: 2871-2875.
9	王民, 李泽洋, 王纯, 等. 基于压缩感知与SURF特征的手语关键帧提取算法. 激光与光电子学进展, 2018, 55(5): 051013. URL
	WANG M, LI Z Y, WANG C, et al. Key frame extraction algorithm of sign language based on compressed sensing and SURF features. Laser & Optoelectronics Progress, 2018, 55(5): 051013. URL
10	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2016: 4489-4497.
11	王粉花, 张强, 黄超, 等. 融合双流三维卷积和注意力机制的动态手势识别. 电子与信息学报, 2021, 43(5): 1389- 1396. URL
	WANG F H, ZHANG Q, HUANG C, et al. Dynamic gesture recognition combining two-stream 3D convolution with attention mechanisms. Journal of Electronics & Information Technology, 2021, 43(5): 1389- 1396. URL
12	ZHOU W G, LUI K S, TAM V W L, et al. Applying (3+2+1)D residual neural network with frame selection for Hong Kong Sign language recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 4296-4302.
13	HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]//Proceedings of IEEE International Conference on Computer Vision Workshops. Washington D. C., USA: IEEE Press, 2018: 3154-3160.
14	FARNEBÄCK G. Two-frame motion estimation based on polynomial expansion. Berlin, Germany: Springer, 2003.
15	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2015: 1-9.
16	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 770-778.
17	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7132-7141.
18	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 3-19.
19	MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2021: 3138-3147.
20	HUANG J E, ZHOU W G, ZHANG Q L, et al. Video-based sign language recognition without temporal segmentation[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018, 32(1): 2257-2264.
21	WANG H J, CHAI X J, HONG X P, et al. Isolated sign language recognition with Grassmann covariance matrices. ACM Transactions on Accessible Computing, 2016, 8(4): 1- 21.
22	ZHANG J H, ZHOU W G, XIE C, et al. Chinese sign language recognition with adaptive HMM[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2016: 1-6.
23	HUANG J, ZHOU W G, LI H Q, et al. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(9): 2822- 2832.
24	LIAO Y Q, XIONG P W, MIN W D, et al. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks. IEEE Access, 2019, 7, 38044- 38054.
25	HUANG J, ZHOU W G, LI H Q, et al. Sign language recognition using 3D convolutional neural networks[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2015: 1-6.

[1]	包善书, 车波, 邓林红. 基于双源域迁移学习的肺音信号识别[J]. 计算机工程, 2023, 49(9): 295-302, 312.
[2]	丰芳宇, 罗晓曙, 蒙志明, 王广宇. 基于抗混叠残差注意力网络的人脸表情识别[J]. 计算机工程, 2023, 49(8): 190-198.
[3]	卢昂, 储珺, 冷璐. 基于高低频特征增强的图像去雾[J]. 计算机工程, 2023, 49(8): 174-181.
[4]	崔晓丹, 刘达维, 刘逸凡, 赵志滨, 任酉贵, 闫永明. 新闻类短视频关键帧摘要模型的研究与实现[J]. 计算机工程, 2023, 49(8): 182-189.
[5]	谌雨章, 黄逸姿, 张钧涵. 基于多速率空洞卷积的多尺度水下小目标检测[J]. 计算机工程, 2023, 49(6): 257-264.
[6]	王同官, 赖惠成, 蔡玉玺, 高古学, 汪烈军. 基于注意力残差网络的人脸超分辨率重建[J]. 计算机工程, 2023, 49(6): 234-241.
[7]	王名茂, 陈向阳, 叶子, 肖利芳, 郑戎. 全局区分性增强与边界监督的篡改检测网络[J]. 计算机工程, 2023, 49(6): 154-161.
[8]	丁子轩, 俞雷, 张娟, 李想, 王新宇. 基于深度残差自适应注意力网络的图像超分辨率重建[J]. 计算机工程, 2023, 49(5): 231-238.
[9]	冉瑞生, 翁稳稳, 王宁, 彭顺顺. 基于人脸关键特征提取的表情识别[J]. 计算机工程, 2023, 49(2): 254-262.
[10]	邹国建, 赖子良, 李晔. 基于时空注意力网络的动态高速路网交通速度预测[J]. 计算机工程, 2023, 49(2): 303-313.
[11]	李建威, 吕晓琪, 谷宇. 基于改进ConvNeXt的皮肤镜图像分类方法[J]. 计算机工程, 2023, 49(10): 239-246, 254.
[12]	王帅坤, 周志勇, 胡冀苏, 钱旭升, 耿辰, 陈光强, 纪建松, 戴亚康. 基于深度学习的肝脏CT-MR图像无监督配准[J]. 计算机工程, 2023, 49(1): 223-233.
[13]	于敏, 屈丹, 司念文. 改进的RetinaNet目标检测算法[J]. 计算机工程, 2022, 48(8): 249-257.
[14]	赫晓慧, 宋定君, 李盼乐, 田智慧, 周广胜. 融合多尺度特征的遥感影像道路提取方法[J]. 计算机工程, 2022, 48(8): 196-205.
[15]	郝阿香, 贾郭军. 结合注意力与批特征擦除的行人重识别模型[J]. 计算机工程, 2022, 48(7): 270-276,306.

选择文件类型/文献管理软件名称

选择包含的内容

基于关键帧和注意力残差网络的手语识别

Sign Language Recognition Based on Keyframe and Attention Residual Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 25

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于关键帧和注意力残差网络的手语识别

Sign Language Recognition Based on Keyframe and Attention Residual Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 25

相关文章 15

编辑推荐

Metrics

本文评价