作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 96-108. doi: 10.19678/j.issn.1000-3428.0069871

• 人工智能与模式识别 • 上一篇    下一篇

基于双维度特征增强的Transformer跟踪器

袁英华, 金英然, 高赟*()   

  1. 云南大学信息学院,云南 昆明 650504
  • 收稿日期:2024-05-20 修回日期:2024-07-04 出版日期:2025-12-15 发布日期:2024-09-05
  • 通讯作者: 高赟
  • 基金资助:
    国家自然科学基金(61802337)

Transformer Tracker Based on Dual-Dimensional Feature Enhancement

YUAN Yinghua, JIN Yingran, GAO Yun*()   

  1. School of Information Science and Engineering, Yunnan University, Kunming 650504, Yunnan, China
  • Received:2024-05-20 Revised:2024-07-04 Online:2025-12-15 Published:2024-09-05
  • Contact: GAO Yun

摘要:

孪生跟踪网络是主流的目标跟踪框架之一,其包括骨干网络、融合网络和定位网络3个模块。对于融合网络模块,Transformer是较新且有效的融合网络实现方法。Transformer的编码器和解码器中使用自注意力机制对卷积神经网络(CNN)特征进行增强。然而,自注意力机制仅能从空间维度进行特征增强,未考虑从通道维度进行特征增强。为了使得Transformer的自注意力网络同时对特征的空间维度和通道维度进行增强,为目标定位网络提供准确的相关性信息,提出一种基于双维度特征增强的Transformer跟踪器,对Transformer融合网络进行改进。首先,采取骨干网络的第三和第四阶段特征作为输入;然后,在Transformer编码器与解码器的自注意力模块中,通过CAE-Net网络进行通道维度的特征增强,用于增强通道上的重要性,通过SAE-Net网络实现两阶段特征的加权融合与线性变换,获取自注意力因子QKV;最后,通过自注意力运算实现空间维度特征增强。在5个主流的公开基准数据集上进行实验,结果表明,改进后的Transformer特征融合模块可以用极小的速度代价提升跟踪器的跟踪性能。

关键词: 目标跟踪, 孪生跟踪网络, 注意力机制, Transformer, 双维度特征增强

Abstract:

The Siamese tracking network is a popular target tracking framework that includes three modules: backbone, fusion, and positioning networks. The Transformer is a relatively new and effective implementation method for fusion network modules. The encoder and decoder of the Transformer use a self-attention mechanism to enhance the features of the Convolutional Neural Network (CNN). However, the self-attention mechanism can only enhance features in the spatial dimension without considering feature enhancement in the channel dimension. To enable the self-attention network of the Transformer to enhance features both in the spatial and channel dimensions and provide accurate correlation information for the target localization network, a Transformer tracker based on dual-dimensional feature enhancement is proposed to improve the Transformer fusion network. First, using the third- and fourth-stage features of the backbone network as inputs, channel dimension feature enhancement is performed via CAE-Net in the self-attention module of the Transformer encoder and decoder to enhance the importance of the channel. Subsequently, two-stage feature-weighted fusion and linear transformation are performed via SAE-Net to obtain the self-attention factors Q, K, and V. Finally, spatial dimension feature enhancement is performed via a self-attention operation. Experiments conducted on five widely used public benchmark datasets reveal that the improved Transformer feature fusion module can improve the tracking performance of the tracker with minimal reduction in speed of tracking.

Key words: object tracking, Siamese tracking network, attention mechanism, Transformer, dual-dimensional feature enhancement