作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (1): 228-241. doi: 10.19678/j.issn.1000-3428.0069368

• 计算机视觉与图形图像处理 • 上一篇    下一篇

基于位置和语义分离注意力机制的轻量视频目标跟踪算法

王珺1, 李昆仑1,*(), 张伊菲1, 朱其振1, 刘磊2, 王帅2   

  1. 1. 河北大学电子信息工程学院, 河北 保定 071000
    2. 淮北师范大学计算机科学与技术学院, 安徽 淮北 235000
  • 收稿日期:2024-02-06 修回日期:2024-06-26 出版日期:2026-01-15 发布日期:2024-10-14
  • 通讯作者: 李昆仑
  • 作者简介:

    王珺(CCF专业会员), 男, 副教授、博士, 主研方向为人工智能、计算机视觉、视频目标跟踪

    李昆仑(通信作者), 教授

    张伊菲, 学士

    朱其振, 硕士研究生

    刘磊, 讲师

    王帅, 副教授

  • 基金资助:
    河北省自然科学基金(F2022201013); 河北省自然科学基金(F2022201055); 河北大学人才引进启动金项目(521100221003); 安徽省高校自然科学基金(KJ2021A0528); 安徽省高校自然科学基金(KJ2020A1202)

Location and Semantic Separation Attention Based Lightweight Visual Object Tracking Algorithm

WANG Jun1, LI Kunlun1,*(), ZHANG Yifei1, ZHU Qizhen1, LIU Lei2, WANG Shuai2   

  1. 1. College of Electronic and Information Engineering, Hebei University, Baoding 071000, Hebei, China
    2. School of Computer Science and Technology, Huaibei Normal University, Huaibei 235000, Anhui, China
  • Received:2024-02-06 Revised:2024-06-26 Online:2026-01-15 Published:2024-10-14
  • Contact: LI Kunlun

摘要:

随着深度大模型技术的不断发展, 基于孪生网络的视频目标跟踪算法主干网络也不断深化, 参数量不断增多, 导致模型训练时间和成本的成倍增长, 对模型在边缘设备上的部署造成了困难。因此, 针对提升轻量级小模型对目标位置和语义信息提取能力的问题, 提出基于位置和语义分离注意力机制的轻量视频目标跟踪算法。首先对归一化注意力机制进行改进并结合水平和竖直方向卷积构建位置注意力, 嵌入到主干网络的浅层特征, 实现对目标位置信息的提取。然后联合通道方向归一化注意力与压缩-激励网络(SENet)注意力, 并将其与主干网络的深层特征进行融合实现对目标语义信息的提取。与之前的注意力机制不同, 分别利用网络中浅层特征有利于空间信息的提取和深层特征有利于语义特征提取的性质将位置注意力和语义注意力分离, 在不明显增加网络参数量的情况下, 提升算法对目标位置和语义信息的提取能力。在通用视频目标跟踪数据集上的实验结果表明, 所提算法能够提升基于轻量级孪生网络跟踪算法的精度和成功率。

关键词: 视频目标跟踪, 孪生网络, 注意力机制, 语义注意力, 位置注意力

Abstract:

With the continuous development of the large deep model, the backbone of Siamese-based visual object tracking is strengthening and the number of parameters is increasing. This has led to doubling of the model training time and cost, making deployment of the model on edge devices challenging. This paper focuses on improving the ability of lightweight models to extract target location and semantic information and proposes a lightweight visual object tracking algorithm based on the location and semantic separation attention mechanism. First, the normalized attention mechanism is improved by combining horizontal and vertical convolutions to construct the position attention and embedding it into the shallow features of the backbone network to extract the target position in the formation. Subsequently, the squeeze-and-excitation network and channel direction normalized attention are fused with the deep features of the backbone network to extract semantic information. In contrast to the previous studies on attention mechanism, this study uses the properties of shallow features that are conducive to spatial information extraction and deep features that are conducive to semantic feature extraction in the network to separate location attention and semantic attention and improves the algorithm's ability to extract the target location and semantic information without significantly increasing the number of parameters. Experimental results on a general tracking dataset demonstrate that the proposed algorithm can improve the precision and success rate of a tracking algorithm based on a lightweight Siamese network.

Key words: visual object tracking, Siamese network, attention mechanism, semantic attention, location attention