基于局部注意的快速视频目标检测方法

doi:10.19678/j.issn.1000-3428.0061362

摘要/Abstract

摘要： 视频目标检测是对视频内的目标进行准确分类与定位。现有基于深度学习的视频目标检测方法通过光流传播特征，不仅存在模型参数量大的问题，而且直接将光流应用于高层特征难以建立准确的空间对应关系。提出一种轻量级的视频目标检测方法。通过设计一种特征传播模型，在不同帧的局部区域内将高层特征从关键帧传播到非关键帧，并将有限的计算资源分配给关键帧，以加快检测速度。构建动态分配关键帧模块，根据目标运动速度动态地调整关键帧选择间隔，以减少计算量并提高检测精度。在此基础上，为进一步降低最大延迟，提出异步检测模式，使得特征传播模型和关键帧选择模块协同工作。实验结果表明，该方法的检测速度和最大延迟分别为31.8 frame/s和31 ms，与基于内存增强的全局-局部聚合方法相比，其在保证检测精度的前提下，具有较快的检测速度，并且实现实时在线的视频目标检测。

关键词: 视频目标检测, 局部注意, 特征传播, 深度可分离卷积, 动态分配, 异步检测

Abstract: Video object detection is used to classify and locate targets in a video accurately.Existing video object detection methods based on deep learning propagate features through optical flow, which not only has the problem of a large number of model parameters, but also directly applies optical flow to high-level features, making it difficult to establish accurate spatial correspondence.This study proposes a lightweight video object detection method.By designing a feature propagation model that propagates high-level features from key frames to non-key frames in the local areas of different frames, it allocates limited computing resources to key frames to increase the detection speed.Based on the target motion speed, a dynamic allocation of key frame module is constructed to dynamically adjust the key frame selection interval to reduce the number of calculations and improve detection accuracy.On this basis, to further reduce the maximum delay, an asynchronous detection mode is proposed to coordinate the feature propagation and calculation of the key frames.The experimental results show that the detection speed and maximum delay of this method are 31.8 frame/s and 31 ms, respectively.Compared with the global local aggregation method based on memory enhancement, it has a faster detection speed on the premise of ensuring detection accuracy and realizes real-time online video target detection.

Key words: video object detection, local attention, feature propagation, depthwise separable convolution, dynamic allocation, asynchronous detection

中图分类号:

TP391.41

史钰祜, 张起贵. 基于局部注意的快速视频目标检测方法[J]. 计算机工程, 2022, 48(5): 314-320.

SHI Yuhu, ZHANG Qigui. Method for Fast Video Object Detection Based on Local Attention[J]. Computer Engineering, 2022, 48(5): 314-320.

http://www.ecice06.com/CN/Y2022/V48/I5/314

图/表 13

20220806175634

20220806175637

20220806175641

20220806175645

20220806175648

20220806175652

20220806175655

20220806175701

20220806175705

20220806175709

20220806175713

20220806175717

20220806175720

参考文献

[1] HE K M, ZHANG X, REN S, et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[2] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6):1137-1149.
[3] GIRSHICK R, DONAHUE J, DARRELL T, et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of IEEE Conference on Computer Vision And Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:580-587.
[4] DAI J F, LI Y, HE K M, et al.R-FCN:object detection via region-based fully convolutional networks[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:379-387.
[5] LIU W, ANGUELOV D, ERHAN D, et al.SSD:single shot multibox detector[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:21-37.
[6] REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified, real-time object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:779-788.
[7] ZHU X, XIONG Y, DAI J, et al.Deep feature flow for video recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2349-2358.
[8] ZHU X, DAI J, YUAN L, et al.Towards high performance video object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7210-7218.
[9] DOSOVITSKIY A, FISCHER P, ILG E, et al.FlowNet:learning optical flow with convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:2758-2766.
[10] 朱锡洲.基于特征光流的视频中物体检测[D].合肥:中国科学技术大学, 2020. ZHU X Z.Flow-based video object detection[D].Hefei:University of Science and Technology of China, 2020.(in Chinese)
[11] 董潇潇.光流引导的多关键帧特征传播与聚合视频目标检测[D].北京:北京邮电大学, 2019. DONG X X.Optical-flow-guided multi-keyframes feature propagation and aggregation for video object detection[D].Beijing:Beijing University of Posts and Telecommunications, 2019.(in Chinese)
[12] 刘玉杰, 曹先知, 李宗民, 等.结合关联特征和卷积神经网络的视频目标检测[J].华南理工大学学报(自然科学版), 2018, 46(12):26-33. LIU Y J, CAO X Z, LI Z M, et al.Video object detection based on correlation feature and convolutional neural network[J].Journal of South China University of Technology(Natural Science Edition), 2018, 46(12):26-33.(in Chinese)
[13] HU H, GU J, ZHANG Z, et al.Relation networks for object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:3588-3597.
[14] DENG J, PAN Y, YAO T, et al.Relation distillation networks for video object detection[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:7023-7032.
[15] 汪常建, 丁勇, 卢盼成.融合改进FPN与关联网络的Faster R-CNN目标检测[J].计算机工程, 2022, 48(2):173-179. WAMG C J, DING Y, LU P C.Object detection using Faster R-CNN combining improved FPN and relation network[J].Computer Engineering, 2022, 48(2):173-179.(in Chinese)
[16] CHEN Y, CAO Y, HU H, et al.Memory enhanced global-local aggregation for video object detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10337-10346.
[17] DAI Z, YANG Z, YANG Y, et al.Transformer-XL:attentive language models beyond a fixed-length con-text[EB/OL].[2021-03-01].https://arxiv.org/pdf/1901.02860.pdf.
[18] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2017:6000-6010.
[19] BAHDANAU D, CHO K, BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2021-03-01].https://arxiv.org/pdf/1409.0473.pdf.
[20] XIAO F, LEE Y J.Video object detection with an aligned spatial-temporal memory[EB/OL].[2021-03-01].https://arxiv.org/pdf/1712.06317v2.pdf.
[21] HOWARD A G, ZHU M L, CHEN B, et al.MobileNets:efficient convolutional neural networks for mobile vision applications[EB/OL].[2021-03-01].https://arxiv.org/pdf/1704.04861.pdf.
[22] 曹渝昆, 桂丽嫒.基于深度可分离卷积的轻量级时间卷积网络设计[J].计算机工程, 2020, 46(9):95-100, 109. CAO Y K, GUI L A.Design of lightweight temporal convolutional network based on depthwise separable convolution[J].Computer Engineering, 2020, 46(9):95-100, 109.(in Chinese)
[23] SHELHAMER E, RAKELLY K, HOFFMAN J, et al.Clockwork convnets for video semantic segmentation[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:852-868.
[24] LIU M S, ZHU M L, WHITE M, et al.Looking fast and slow:memory-guided mobile video object detection[EB/OL].[2021-03-01].https://arxiv.org/pdf/1903.10172.pdf.
[25] GADDE R, JAMPANI V, GEHLER P V.Semantic video CNNs through representation warping[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:4453-4462.
[26] CHEN K, WANG J, YANG S, et al.Optimizing video object detection via a scale-time lattice[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7814-7823.
[27] SHVETS M, LIU W, BERG A C.Leveraging long-range temporal relationships between proposals for video object detection[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:9756-9764.
[28] JIANG Z, LIU Y, YANG C, et al.Learning where to focus for efficient video object detection[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2020:18-34.
[29] WANG X, GIRSHICK R, GUPTA A, et al.Non-local neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7794-7803.

选择文件类型/文献管理软件名称

选择包含的内容