作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 224-230, 242. doi: 10.19678/j.issn.1000-3428.0066523

• 图形图像处理 • 上一篇    下一篇

基于关键帧和注意力残差网络的手语识别

刘群坡1,2, 盛月琴1,2,*, 高如新1,2, 卜旭辉1,2   

  1. 1. 河南理工大学 电气工程与自动化学院, 河南 焦作 454003
    2. 河南省智能装备直驱技术与控制国际联合实验室, 河南 焦作 454003
  • 收稿日期:2022-12-14 出版日期:2023-12-15 发布日期:2023-03-10
  • 通讯作者: 盛月琴
  • 作者简介:

    刘群坡(1978—),男,副教授,主研方向为智能机器人、机器视觉

    高如新,副教授、博士

    卜旭辉,教授、博士

  • 基金资助:
    国家自然科学基金(62273133); 河南省高校科技创新团队项目(20IRTSTHN019); 河南省科技攻关项目(212102210508)

Sign Language Recognition Based on Keyframe and Attention Residual Network

Qunpo LIU1,2, Yueqin SHENG1,2,*, Ruxin GAO1,2, Xuhui BU1,2   

  1. 1. School of Electrical Engineering and Automation, Henan Polytechnic University, Jiaozuo 454003, Henan, China
    2. International Joint Laboratory of Direct Drive and Control of Intelligent Equipment, Jiaozuo 454003, Henan, China
  • Received:2022-12-14 Online:2023-12-15 Published:2023-03-10
  • Contact: Yueqin SHENG

摘要:

手语识别研究对于改善聋哑人生活质量具有重要意义,同时可促进人机交互领域的发展。针对手语视频中存在大量的无关帧、手语识别过程中手部细节信息提取不足、难以精确定位手语动作的位置和时间信息导致识别率不高等问题,提出一种基于关键帧和交互式注意力残差网络的手语识别方法。在数据预处理部分,设计基于图像相似度和模糊程度的关键帧提取算法,从基于Farneback光流法获取的大量候选关键帧中确定最终的关键帧,减少无关冗余信息。在网络部分,以3D-ResNet为基础框架,构建小卷积模块增强网络对手语视频中细粒度特征的提取能力,设计在捷径分支中采用池化卷积下采样方式的残差结构减小特征图失真程度,建立融合通道注意力和空间注意力的交互式四重注意力模块强化对目标区域关键特征的提取。实验结果表明,该方法在CSL和DEVISIGN数据集上取得了92.0%和92.2%的准确率,优于其他手语识别方法。

关键词: 手语识别, 关键帧, 残差网络, 空间注意力, 通道注意力

Abstract:

The study of sign language recognition is crucial for improving the quality of life of deaf-mute people and promoting the development of human-computer interactions. Typically, sign language videos contain numerous irrelevant frames. The extraction of hand details is insufficient for the sign language recognition process. Moreover, the position and time information of sign language movements cannot be accurately located. Thus, this study proposed a sign language recognition method based on keyframes and an interactive attention residual network. In the data preprocessing part, a keyframe extraction algorithm based on image similarity and blur degree is proposed to determine the final keyframes from the several candidate keyframes obtained using the Farneback optical flow method, which reduces irrelevant redundant information. In the network, based on the 3D-ResNet framework, a small convolution module is constructed to replace the first convolution layer of the original 3D-ResNet, which enhances the ability of the network to extract fine-grained features of the hands. Subsequently, the pooling convolution undersampling method is used in the shortcut branch of the residual structure to reduce the distortion degree of the feature map. A quadruplet attention module is designed to extract more effective feature information by integrating channel and spatial attention. Experiments are conducted using the CSL and DEVISIGN datasets. The results show that the method obtains 92.0% and 92.2% accuracy on the CSL and DEVISIGN datasets, respectively, which are higher than those of other sign language recognition methods.

Key words: sign language recognition, keyframe, residual network, spatial attention, channel attention