作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

可变形空洞卷积与三支注意力机制的手写数学表达式识别

  • 发布日期:2026-05-15

Handwritten Mathematical Expression Recognition With Deformable Dilated Convolution and Triplet Attention

  • Published:2026-05-15

摘要: 手写数学表达式识别作为计算机视觉领域的重要任务之一,在智能教育、工业应用等诸多方面均发挥着重要作用。现有基于编码器-解码器结构的手写数学表达式识别模型通常采用普通卷积和传统注意力机制来提取特征。然而,普通卷积的固定网格采样忽略了手写字符的几何变形,导致形近字符误识率较高;此外,传统注意力机制的单一交互导致对长程结构依赖的捕捉能力不足。为此,研究基于编码器-解码器结构提出了一个基于可变形空洞卷积和三支注意力特征融合的模型,在可变形卷积的偏移量学习和自定义卷积层中融入可学习的空洞率,实现对偏移量的更准确预测和感受野的自适应扩展;同时,三支注意力特征融合机制通过相似度引导的动态融合策略,实现跨维度信息的协同增强,避免了传统注意力机制的单一维度交互不足。模型在编码器中采用可变形空洞卷积来扩大自身感受野,捕捉不同尺度的特征,提升对更大范围内上下文信息的捕捉能力;采用三支注意力特征融合机制,有效整合不同层次的特征信息,增强模型对关键特征的提取能力;解码器迭代为Transformer,强化长程依赖建模。模型在CROHME 2014、2016、2019公开数据集和HME100K数据集上的实验中分别获得了59.34%、59.77%、59.63%和68.94%的识别准确率,较基准模型分别提高了2.34%、3.71%、4.75%和1.63%,验证了模型的有效性与优越性。

Abstract: Handwritten mathematical expression recognition is an important task in computer vision and plays a significant role in intelligent education, industrial applications, and related fields. Existing encoder-decoder-based methods typically rely on standard convolutions and conventional attention mechanisms for feature extraction. However, the fixed-grid sampling of standard convolution cannot effectively adapt to the geometric deformations of handwritten symbols, which often leads to confusion between visually similar characters. In addition, traditional attention mechanisms usually involve limited cross-dimensional interaction, making it difficult to capture long-range structural dependencies in complex mathematical expressions. To address these issues, this paper proposes a handwritten mathematical expression recognition model based on an encoder-decoder architecture, termed DDTAFF, which integrates deformable dilated convolution and triplet attention feature fusion. Specifically, deformable dilated convolution incorporates learnable dilation rates into both the offset learning process and the customized convolution operation of deformable convolution, enabling more accurate offset prediction and adaptive expansion of the receptive field. Meanwhile, triplet attention feature fusion adopts a similarity-guided dynamic fusion strategy to enhance cross-dimensional feature interaction and improve the extraction of discriminative features. In the encoder, deformable dilated convolution is used to capture multi-scale features and broader contextual information, while triplet attention feature fusion effectively fuses features at different levels to strengthen the representation of critical regions. In the decoder, a Transformer-based structure is introduced to enhance long-range dependency modeling. Experimental results on the CROHME 2014, CROHME 2016, CROHME 2019, and HME100K datasets show that the proposed model achieves recognition accuracies of 59.34%, 59.77%, 59.63%, and 68.94%, respectively, representing improvements of 2.34%, 3.71%, 4.75%, and 1.63% over the baseline model. These results demonstrate the effectiveness and superiority of the proposed method.