作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (11): 107-118. doi: 10.19678/j.issn.1000-3428.0068772

• 人工智能与模式识别 • 上一篇    下一篇

时域孪生网络融合Transformer的长时无人机视觉跟踪

谌海云*(), 余鹏, 王海川   

  1. 西南石油大学电气信息学院, 四川 成都 610500
  • 收稿日期:2023-11-06 出版日期:2024-11-15 发布日期:2024-11-01
  • 通讯作者: 谌海云
  • 基金资助:
    智能电网与智能控制南充市重点实验室平台建设(二期)项目(SXHZ053)

Long-term UAV Vision Tracking with Time Domain Siamese Network Fusion Transformer

SHEN Haiyun*(), YU Peng, WANG Haichuan   

  1. School of Electrical Information, Southwest Petroleum University, Chengdu 610500, Sichuan, China
  • Received:2023-11-06 Online:2024-11-15 Published:2024-11-01
  • Contact: SHEN Haiyun

摘要:

针对无人机(UAV)执行跟踪任务时经常出现尺寸变化、低分辨率、目标遮挡等场景导致跟踪目标框漂移的问题, 提出一种时域孪生网络融合Transformer的长时无人机视觉跟踪算法TTTrack。首先, 使用基于孪生网络的SiamFC++(AlexNet)算法作为基线算法; 其次, 利用Transformer自适应地提取历史帧的时空信息并在线更新模板, 从而将时空上下文信息储存为动态模板; 随后, 分别使用基准模板和动态模板与搜索特征图进行互相关运算, 获得响应图后利用Transformer融合两个响应图, 从而在连续帧之间建立时空上下文映射关系。实验结果表明, 在LaSOT长序列跟踪基准上TTTrack的成功率和精确率分别为63.9%和66.6%, 在UAV123跟踪基准上的成功率和精确率分别为61.4%和80.2%。与基线算法相比, 该算法在完全遮挡场景下的成功率和精确率分别提升7.4和8.0个百分点。TTTrack在DTB70跟踪基准上精确率达到82.1%, 并且跟踪速度为118帧/s, 满足实时性要求。测试结果验证了TTTrack具有良好的鲁棒性、实时性和抗干扰能力, 能有效应对长时UAV跟踪任务。

关键词: 时域孪生网络, Transformer模型, 无人机, 视觉跟踪, 时空信息

Abstract:

Frame drift often occurs when a Unmanned Aerial Vehicle (UAV) performs tracking tasks involving size changes, low resolution, and target occlusion. To that end, this study proposes a time-domain Siamese network fusion Transformer long-term UAV vision, which is called TTTrack. First, the SiamFC++ (AlexNet) algorithm based on the Siamese network is used as the baseline algorithm; Second, the Transformer is used to adaptively extract the spatio-temporal information of the historical frame and update the template online to store the spatio-temporal context information as a dynamic template; Third, the benchmark template is cross-correlated with the dynamic template and the search feature map is carried out to obtain two response maps; Finally, the Transformer is used to fuse the two response maps to establish a spatio-temporal context mapping relationship between consecutive frames. Based on the LaSOT long-sequence tracking benchmark, the success rate and accuracy of TTTrack are 63.9% and 66.6%, respectively. The success rate and accuracy of the UAV123 tracking benchmark are 61.4% and 80.2%, respectively. Compared with the baseline algorithm, the success rate and accuracy of this algorithm in fully occluded scenes increased by 7.4 percent and 8.0 percent points, respectively. TTTrack has an accuracy of 82.1% on the DTB70 tracking benchmark and a tracking speed of 118 frame/s, satisfying real-time requirements. The test results show that the proposed algorithm has good robustness, real-time performance, and anti-interference ability and can effectively handle long-term UAV tracking tasks.

Key words: time-domain Siamese network, Transformer model, Unmanned Aerial Vehicle (UAV), visual tracking, spatio-temporal information