作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (4): 281-288,296. doi: 10.19678/j.issn.1000-3428.0064096

• 开发研究与工程应用 • 上一篇    下一篇

结合卷积Transformer的目标跟踪算法

王春雷1,2,3, 张建林1,2, 李美惠1,2, 徐智勇1,2, 魏宇星1,2   

  1. 1. 中国科学院光束控制重点实验室, 成都 610209;
    2. 中国科学院光电技术研究所, 成都 610209;
    3. 中国科学院大学 电子电气与通信工程学院, 北京 100049
  • 收稿日期:2022-03-04 修回日期:2022-04-21 发布日期:2023-04-07
  • 作者简介:王春雷(1996-),男,硕士研究生,主研方向为目标跟踪;张建林(通信作者),研究员、博士、博士生导师;李美惠,博士;徐智勇,研究员、博士生导师;魏宇星,副研究员。
  • 基金资助:
    国家自然科学基金青年科学基金“基于交叉度量跨模态学习的多谱段目标跟踪方法研究”(62101529)。

Object Tracking Algorithm Combining Convolution and Transformer

WANG Chunlei1,2,3, ZHANG Jianlin1,2, LI Meihui1,2, XU Zhiyong1,2, WEI Yuxing1,2   

  1. 1. Key Laboratory of Beam Control, Chinese Academy of Sciences, Chengdu 610209, China;
    2. Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China;
    3. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2022-03-04 Revised:2022-04-21 Published:2023-04-07

摘要: 现有基于Transformer的目标跟踪算法未充分利用Transformer的长距离依赖属性,导致算法提取的特征判别性不足,跟踪稳定性较差。为提高孪生网络目标跟踪算法在复杂场景中的跟踪能力,结合卷积与Transformer的优势,提出目标跟踪算法CTTrack。在特征提取方面,利用卷积丰富的局部信息和Transformer的长距离依赖属性,以卷积和窗口注意力串联的方式和层次化的结构构建一个通用的目标跟踪骨干网络CTFormer。在特征融合方面,利用互注意力机制构建特征互增强与聚合网络以简化网络结构,加快跟踪速度。在搜索区域选择方面,结合目标运动速度估计,设计自适应调整搜索区域的跟踪策略。实验结果表明,CTTrack在GOT-10k数据集上的平均重叠度为70.3%,相比基于Transformer的跟踪算法TransT和TrDiMP均提高3.2个百分点,在UAV123数据集上的曲线下面积为71.1%,相比TransT和TrDiMP分别提高2.0个百分点和3.6个百分点。在TrackingNet、LaSOT、OTB2015、NFS数据集上分别取得82.1%、66.8%、70.1%、66.3%的曲线下面积,并能以43帧/s的速度进行实时跟踪。

关键词: 孪生网络, Transformer目标跟踪, 窗口注意力, 互注意力, 运动估计, 搜索区域

Abstract: The existing target object algorithms based on Transformer do not fully use Transformer's long-distance dependence attribute, resulting in insufficient discriminability of the features extracted by the algorithm and poor tracking stability.To improve the object tracking ability, a object tracking algorithm CTTrack is proposed for complex scenes, combining the advantages of convolution and Transformer.In terms of feature extraction, the algorithm combines the rich local information of convolution and long-distance dependence attribute of Transformer to construct a general object tracking backbone network CTFormer, by concatenating convolution and window attention in a hierarchical structure.In feature fusion, only the Cross-Attention Mechanism(CAM) is used to construct the feature mutual enhancement and aggregation networks, which simplifies the network structure and improves tracking speed.In search area selection, the tracking strategy of adaptive search area adjustment is designed based on object motion speed estimation.The experimental results show that the Average Overlap(AO) of CTTrack on GOT-10k dataset is 70.3%, which is 3.2 percentage points higher than that of TransT and TrDiMP, and the Area Under the Curve(AUC) on the UAV123 dataset is 71.1%, which is 2.0 and 3.6 percentage points higher than on TransT and TrDiMP, respectively.The AUC on the TrackingNet, LaSOT, OTB2015, and NFS datasets, are 82.1%, 66.8%, 70.1%, and 66.3%, respectively, with real-time tracking at a speed of 43 frames/s.

Key words: siamese network, Transformer object tracking, window attention, cross-attention, motion estimation, search area

中图分类号: