[1] HE K M, ZHANG X, REN S, et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778. [2] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6):1137-1149. [3] GIRSHICK R, DONAHUE J, DARRELL T, et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of IEEE Conference on Computer Vision And Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:580-587. [4] DAI J F, LI Y, HE K M, et al.R-FCN:object detection via region-based fully convolutional networks[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:379-387. [5] LIU W, ANGUELOV D, ERHAN D, et al.SSD:single shot multibox detector[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:21-37. [6] REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified, real-time object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:779-788. [7] ZHU X, XIONG Y, DAI J, et al.Deep feature flow for video recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2349-2358. [8] ZHU X, DAI J, YUAN L, et al.Towards high performance video object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7210-7218. [9] DOSOVITSKIY A, FISCHER P, ILG E, et al.FlowNet:learning optical flow with convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:2758-2766. [10] 朱锡洲.基于特征光流的视频中物体检测[D].合肥:中国科学技术大学, 2020. ZHU X Z.Flow-based video object detection[D].Hefei:University of Science and Technology of China, 2020.(in Chinese) [11] 董潇潇.光流引导的多关键帧特征传播与聚合视频目标检测[D].北京:北京邮电大学, 2019. DONG X X.Optical-flow-guided multi-keyframes feature propagation and aggregation for video object detection[D].Beijing:Beijing University of Posts and Telecommunications, 2019.(in Chinese) [12] 刘玉杰, 曹先知, 李宗民, 等.结合关联特征和卷积神经网络的视频目标检测[J].华南理工大学学报(自然科学版), 2018, 46(12):26-33. LIU Y J, CAO X Z, LI Z M, et al.Video object detection based on correlation feature and convolutional neural network[J].Journal of South China University of Technology(Natural Science Edition), 2018, 46(12):26-33.(in Chinese) [13] HU H, GU J, ZHANG Z, et al.Relation networks for object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:3588-3597. [14] DENG J, PAN Y, YAO T, et al.Relation distillation networks for video object detection[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:7023-7032. [15] 汪常建, 丁勇, 卢盼成.融合改进FPN与关联网络的Faster R-CNN目标检测[J].计算机工程, 2022, 48(2):173-179. WAMG C J, DING Y, LU P C.Object detection using Faster R-CNN combining improved FPN and relation network[J].Computer Engineering, 2022, 48(2):173-179.(in Chinese) [16] CHEN Y, CAO Y, HU H, et al.Memory enhanced global-local aggregation for video object detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10337-10346. [17] DAI Z, YANG Z, YANG Y, et al.Transformer-XL:attentive language models beyond a fixed-length con-text[EB/OL].[2021-03-01].https://arxiv.org/pdf/1901.02860.pdf. [18] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2017:6000-6010. [19] BAHDANAU D, CHO K, BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2021-03-01].https://arxiv.org/pdf/1409.0473.pdf. [20] XIAO F, LEE Y J.Video object detection with an aligned spatial-temporal memory[EB/OL].[2021-03-01].https://arxiv.org/pdf/1712.06317v2.pdf. [21] HOWARD A G, ZHU M L, CHEN B, et al.MobileNets:efficient convolutional neural networks for mobile vision applications[EB/OL].[2021-03-01].https://arxiv.org/pdf/1704.04861.pdf. [22] 曹渝昆, 桂丽嫒.基于深度可分离卷积的轻量级时间卷积网络设计[J].计算机工程, 2020, 46(9):95-100, 109. CAO Y K, GUI L A.Design of lightweight temporal convolutional network based on depthwise separable convolution[J].Computer Engineering, 2020, 46(9):95-100, 109.(in Chinese) [23] SHELHAMER E, RAKELLY K, HOFFMAN J, et al.Clockwork convnets for video semantic segmentation[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2016:852-868. [24] LIU M S, ZHU M L, WHITE M, et al.Looking fast and slow:memory-guided mobile video object detection[EB/OL].[2021-03-01].https://arxiv.org/pdf/1903.10172.pdf. [25] GADDE R, JAMPANI V, GEHLER P V.Semantic video CNNs through representation warping[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:4453-4462. [26] CHEN K, WANG J, YANG S, et al.Optimizing video object detection via a scale-time lattice[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7814-7823. [27] SHVETS M, LIU W, BERG A C.Leveraging long-range temporal relationships between proposals for video object detection[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:9756-9764. [28] JIANG Z, LIU Y, YANG C, et al.Learning where to focus for efficient video object detection[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2020:18-34. [29] WANG X, GIRSHICK R, GUPTA A, et al.Non-local neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7794-7803. |