结合卷积Transformer的目标跟踪算法

doi:10.19678/j.issn.1000-3428.0064096

摘要/Abstract

摘要： 现有基于Transformer的目标跟踪算法未充分利用Transformer的长距离依赖属性，导致算法提取的特征判别性不足，跟踪稳定性较差。为提高孪生网络目标跟踪算法在复杂场景中的跟踪能力，结合卷积与Transformer的优势，提出目标跟踪算法CTTrack。在特征提取方面，利用卷积丰富的局部信息和Transformer的长距离依赖属性，以卷积和窗口注意力串联的方式和层次化的结构构建一个通用的目标跟踪骨干网络CTFormer。在特征融合方面，利用互注意力机制构建特征互增强与聚合网络以简化网络结构，加快跟踪速度。在搜索区域选择方面，结合目标运动速度估计，设计自适应调整搜索区域的跟踪策略。实验结果表明，CTTrack在GOT-10k数据集上的平均重叠度为70.3%，相比基于Transformer的跟踪算法TransT和TrDiMP均提高3.2个百分点，在UAV123数据集上的曲线下面积为71.1%，相比TransT和TrDiMP分别提高2.0个百分点和3.6个百分点。在TrackingNet、LaSOT、OTB2015、NFS数据集上分别取得82.1%、66.8%、70.1%、66.3%的曲线下面积，并能以43帧/s的速度进行实时跟踪。

关键词: 孪生网络, Transformer目标跟踪, 窗口注意力, 互注意力, 运动估计, 搜索区域

Abstract: The existing target object algorithms based on Transformer do not fully use Transformer's long-distance dependence attribute, resulting in insufficient discriminability of the features extracted by the algorithm and poor tracking stability.To improve the object tracking ability, a object tracking algorithm CTTrack is proposed for complex scenes, combining the advantages of convolution and Transformer.In terms of feature extraction, the algorithm combines the rich local information of convolution and long-distance dependence attribute of Transformer to construct a general object tracking backbone network CTFormer, by concatenating convolution and window attention in a hierarchical structure.In feature fusion, only the Cross-Attention Mechanism(CAM) is used to construct the feature mutual enhancement and aggregation networks, which simplifies the network structure and improves tracking speed.In search area selection, the tracking strategy of adaptive search area adjustment is designed based on object motion speed estimation.The experimental results show that the Average Overlap(AO) of CTTrack on GOT-10k dataset is 70.3%, which is 3.2 percentage points higher than that of TransT and TrDiMP, and the Area Under the Curve(AUC) on the UAV123 dataset is 71.1%, which is 2.0 and 3.6 percentage points higher than on TransT and TrDiMP, respectively.The AUC on the TrackingNet, LaSOT, OTB2015, and NFS datasets, are 82.1%, 66.8%, 70.1%, and 66.3%, respectively, with real-time tracking at a speed of 43 frames/s.

Key words: siamese network, Transformer object tracking, window attention, cross-attention, motion estimation, search area

中图分类号:

TP391

王春雷, 张建林, 李美惠, 徐智勇, 魏宇星. 结合卷积Transformer的目标跟踪算法[J]. 计算机工程, 2023, 49(4): 281-288,296.

WANG Chunlei, ZHANG Jianlin, LI Meihui, XU Zhiyong, WEI Yuxing. Object Tracking Algorithm Combining Convolution and Transformer[J]. Computer Engineering, 2023, 49(4): 281-288,296.

https://www.ecice06.com/CN/Y2023/V49/I4/281

图/表 13

20230417190424

20230417190427

20230417190521

20230417190525

20230417190528

20230417190531

20230417190534

20230417190538

20230417190541

20230417190544

20230417190547

20230417190550

20230417190554

参考文献

[1] 李珑, 刘凯, 李玲.基于目标检测的时空上下文跟踪算法[J].计算机工程, 2018, 44(9):263-268, 273. LI L, LIU K, LI L.Spatial-temporal context tracking algorithm based on target detection[J].Computer Engineering, 2018, 44(9):263-268, 273.(in Chinese)
[2] 任立成, 杨嘉棋, 魏宇星, 等.基于特征融合与双模板嵌套更新的孪生网络跟踪算法[J].计算机工程, 2021, 47(7):239-248. REN L C, YANG J Q, WEI Y X, et al.Tracking algorithm using siamese network based on feature fusion and dual-template nested update[J].Computer Engineering, 2021, 47(7):239-248.(in Chinese)
[3] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al.Fully-convolutional siamese networks for object tracking[EB/OL].[2022-02-01].https://arxiv.org/pdf/1606.09549.pdf.
[4] LI B, YAN J J, WU W, et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:8971-8980.
[5] LI B, WU W, WANG Q, et al.SiamRPN:evolution of siamese visual tracking with very deep networks[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:4277-4286.
[6] ZHANG Z P, PENG H W.Deeper and wider siamese networks for real-time visual tracking[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:4586-4595.
[7] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[8] XU Y D, WANG Z Y, LI Z X, et al.SiamFC++:towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of Conference on Artificial Intelligence.[S.l.]:AAAI Press, 2020:12549-12556.
[9] GUO D Y, WANG J, CUI Y, et al.SiamCAR:siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:6268-6276.
[10] CHEN X, YAN B, ZHU J W, et al.Transformer tracking[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2021:8122-8131.
[11] YAN B, PENG H W, FU J L, et al.Learning spatio-temporal Transformer for visual tracking[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2022:10428-10437.
[12] WANG N, ZHOU W G, WANG J, et al.Transformer meets tracker:exploiting temporal context for robust visual tracking[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2021:1571-1580.
[13] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems.Washington D.C., USA:IEEE Press, 2017:5998-6010.
[14] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.
[15] CARION N, MASSA F, SYNNAEVE G, et al.End-to-end object detection with Transformers[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2020:213-229.
[16] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An image is worth 16×16 words:Transformers for image recognition at scale[C]//Proceedings of International Conference on Learning Representations.Washington D.C., USA:[s.n.], 2020:1-9.
[17] WANG W H, XIE E Z, LI X, et al.Pyramid vision Transformer:a versatile backbone for dense prediction without convolutions[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2022:548-558.
[18] LIU Z, LIN Y T, CAO Y, et al.Swin Transformer:hierarchical vision Transformer using shifted windows[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2022:9992-10002.
[19] Tan M X, Le Q V.EfficientNetV2:smaller models and faster training[EB/OL].[2022-02-01].https://arxiv.org/abs/2104.00298.
[20] HUANG L H, ZHAO X, HUANG K Q.GOT-10k:a large high-diversity benchmark for generic object tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(5):1562-1577.
[21] FAN H, LIN L T, YANG F, et al.LaSOT:a high-quality benchmark for large-scale single object tracking[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:5369-5378.
[22] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2014:740-755.
[23] MÜLLER M, BIBI A, GIANCOLA S, et al.TrackingNet:a large-scale dataset and benchmark for object tracking in the wild[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2018:310-327.
[24] VOIGTLAENDER P, LUITEN J, TORR P H S, et al.Siam R-CNN:visual tracking by re-detection[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:6577-6587.
[25] DANELLJAN M, VAN GOOL L, TIMOFTE R.Probabilistic regression for visual tracking[C]//Proceedings of Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:7181-7190.
[26] ZHANG Z P, PENG H W, FU J L, et al.Ocean:object-aware anchor-free tracking[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2020:771-787.
[27] BHAT G, DANELLJAN M, VAN GOOL L, et al.Learning discriminative model prediction for tracking[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2020:6181-6190.
[28] DANELLJAN M, BHAT G, KHAN F S, et al.ATOM:accurate tracking by overlap maximization[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:4655-4664.
[29] MUELLER M, SMITH N, GHANEM B.A benchmark and simulator for UAV tracking[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:445-461.
[30] WU Y, LIM J, YANG M H.Online object tracking:a benchmark[C]//Proceedings of Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2013:2411-2418.
[31] GALOOGAHI H K, FAGG A, HUANG C, et al.Need for speed:a benchmark for higher frame rate object tracking[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:1125-1134.

选择文件类型/文献管理软件名称

选择包含的内容