Video Text Semantic Alignment and Full Video Dependency for Weakly Supervised Action Localization

doi:10.19678/j.issn.1000-3428.0252238

Abstract

Abstract: In response to the challenges in existing weakly supervised temporal action localization research, such as underutilization of action temporal characteristics, global properties, and action semantic consistency, a method is proposed that video text semantic alignment and full video dependency (FVD-ALM). Firstly, dilated convolutions network to expand the model's receptive field and attention mechanisms are utilized to precisely enhance the temporal features of action instances, ensuring accurate temporal feature extraction. Then, an expectation maximization algorithm based on Gaussian mixture model is applied to extract and enhance global information from the video, generating accurate temporal class activation maps to aid in the action localization process. Finally, video-text semantic alignment module is designed to comprehensively understand actions by combining the textual information in action labels. The model is trained to complete the textual descriptions of actions, thereby enhancing its cognitive ability of action category consistency and effectively distinguishing different action categories. Experimental results on the THUMOS14 and ActivityNet1.3 datasets confirm the effectiveness of this method, achieving gains 39.1% in terms of average mAP on THUMOS14, which is 2.0 percentage points improvement over the DTRP-Loc method. This demonstrates that the method of integrating multi-source information significantly improves the accuracy of action localization and provides an effective solution for weakly supervised action localization tasks.

摘要： 针对现有弱监督动作定位研究存在的未充分利用动作的时序特性、全局特性和动作语义一致性等问题，提出视频文本语义对齐与全视频依赖的方法（FVD-ALM），充分利用多源信息以提升动作定位的准确性和鲁棒性。首先，依托膨胀卷积扩大模型的感受野，结合注意力机制对视频内动作的变化实施精确的特征强化，确保获得准确的时序特征，捕捉动作的动态变化。然后，采用基于高斯混合模型的期望最大化算法提取并强化视频中的全局信息，生成精确的时序激活图，理解视频的整体内容，辅助动作的定位过程。最后，设计视频文本语义对齐模块，结合动作标签中的文本信息全面理解动作，训练模型补全描述动作的文本信息，增强模型对动作类别一致性的认知并有效区分不同动作类别。实验结果表明，在THUMOS14和ActivityNet1.3这两个主流数据集上，该方法均有效，其中在THUMOS14上实现了39.1%平均mAP，比DTRP-Loc方法提高了2.0个百分点，证实了结合多源信息的方法能够显著提高动作定位的准确性，为弱监督动作定位任务提供了一种有效的解决方案。

DANG Weichao, PEI Lixian, GAO Gaimei, LIU Chunxia. Video Text Semantic Alignment and Full Video Dependency for Weakly Supervised Action Localization[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252238.

党伟超, 裴丽仙, 高改梅, 刘春霞. 视频文本语义对齐与全视频依赖的弱监督动作定位[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252238.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252238

References

[1] [2] [3] [4] [5] CHENG F, BERTASIUS G. Tallformer: temporal action localization with long-memory transformer[C]//Proceedi ngs of the 17th European Conference on Computer Vis ion. Cham, Switzerland: Springer Press, 2022: 503-521. TIRUPATTUR P, Duarte K, RAWAT Y S, SHAH M. Modeling multi-label action dependencies for temporal a ction localization[C]//Proceedings of 2021 IEEE/CVF C onference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE/CVF Press, 2021: 1460-1470. ZHANG M, HU H Y, LI Z J. Temporal action localiz ation with coarse-to-fine network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022 1 0(2): 96378-96387. HE B, YANG X T, KANG L, CHENG Z Y, ZHOU X F, SHRIVASTAVA A. ASM-Loc: action-aware segm ent modeling for weakly-supervised temporal action loca lization[C]//Proceedings of 2022 Conference on Comput er Vision and Pattern Recognition(CVPR). New Orleans, LA, USA: CVPR Press, 2022: 13915-13925. HU Y F, FU J, CHEN M Y, GAO J Y, DONG J F, FAN B, LIU H M. Learning proposal-aware re-ranking for weakly-supervised temporal action localization[J]. I EEE Transactions on Circuits and Systems for Video T echnology, 2024, 34(1): 207-220. [6] [7] [8] [9] 郭文斌,杨兴明,蒋哲远,吴克伟,谢昭.多时间尺度一致性的弱监督时序动作定位[J].计算机工程与应用,2023,59(10):1 51-161. GUO W B, YANG X M, JIANG Z Y, WU K W, XI E Z. Weakly supervised timing action localization with multi-time scale consistency[J].Computer Engineering a nd Application, 2023, 59(10): 151-161.(in Chinese) 侯永宏,李岳阳,郭子慧.基于对比学习的弱监督时序动作定位[J].天津大学学报(自然科学与工程技术版),2023,56 (1):73-80. HOU Y H, LI Y Y, GUO Z H. Weakly supervised te mporal action localization based on contrastive learning [J].Journal of Tianjin University (Natural Science and E ngineering Technology Edition), 2023, 56(1): 73-80.(in Chinese) Zhou J X, Wu Y. Temporal feature enhancement dilate d convolution network for weakly-supervised temporal a ction localization[C]//Proceedings of 2021 IEEE/CVF Int ernational Conference on Applications of Computer Visi on(CVPR). Waikoloa, HI, USA: IEEE/CVF Press, 2023: 6028-6037. Huang L, Wang L, Li H. Weakly supervised temporal action localization via representative snippet knowledge propagation[C]//Proceedings of 2022 IEEE/CVF Confer ence on Computer Vision and Pattern Recognition(CVP R). New Orleans, LA, USA: IEEE/CVF Press, 2022: 3 262-3271. [10] QING Z W, SUHS, WEIHAO GAN, WANG D L, W U W, WANG X, QIAO Y, YAN J J, GAO C X, SA NG N. Temporal context aggregation network for temp oral action proposal refinement[C]//Proceedings of 2021 Conference on Computer Vision and Pattern Recogniti on(CVPR). Nashville, TN, USA: CVPR Press, 2021: 48 5-494. [11] SRIDHAR D, QUADER N, MURALIDHARAN S, LI Y, DAI P, LU J. Class semantics-based attention for ac tion detection[C]//Proceedings of 2021 IEEE/CVF Intern ational Conference on Computer Vision. Montreal, QC, Canada: IEEE/CVF Press, 2021: 13719-13728. [12] ZHENG Z, WANG P, LIU W, LI J, YE R, REN D. Distance-IoU loss: faster and better learning for boundi ng box regression[C]//Proceedings of the AAAI Confere nce onArtificial Intelligence. [S. l.]: AAAI Press, 2020: 12993–13000. [13] HUANG J, KONG M, CHEN L Y, et al. Temporal rp n learning for weakly-supervised temporal action localiz ation[C]// Proceedings of the 15th Asian Conference on Machine Learning. New York, USA: PMLR Press, 20 24: 470-485. [14] LIU Y, ZHU H, REN H, SHI J, WANG D. Fusion de tection network with discriminative enhancement for we akly-supervised temporal action localization[J]. Expert S ystems with Applications, 2024, 238(2): 122000-122010. [15] PAUL S, ROY S, ROY-CHOWDHURY A K. W-TAL C: Weakly-Supervised Temporal Activity Localization a nd Classification[C]//Proceedings of the 15th European Conference on Computer Vision. Cham, Switzerland: Sp ringer Press, 2018: 588-607. [16] ZHANG C, CAO M, YANG D, CHEN J, ZOU Y. Co la: Weakly-supervised temporal action localization with snippet contrastive learning[C]//Proceedings of 2021 IEE E/CVF Conference on Computer Vision and Pattern Re cognition(CVPR). Nashville, TN, USA: IEEE/CVF Press, 2021: 16005-16014. [17] GAO J, CHEN M, XU C. Fine-grained temporal contra stive learning for weakly-supervised temporal action loc alization[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New Orleans, LA, USA: IEEE/CVF Press, 2022: 19967-19977. [18] LIU P, WANG C, QIN J, LIN G. Feature enhancement and foreground-background separation for weakly super vised temporal action localization[C]//Proceedings of the 5thACM International Conference on Multimedia. New York, USA: ACM Press, 2024: 1-7 [19] QU S, CHEN G, LI Z, ZHANG L, LU F, KNOLL A. ACM-Net: action context modeling network for weakly-supervised temporal action localization[EB/OL]. [2021-0 4-07]. https://arxiv.org/abs/2104.02967. [20] HONG F, FENG J, XU D, SHAN Y, ZHENG W. Cro ss-modal consensus network for weakly supervised temp oral action localization[C]//Proceedings of the 29th AC M International Conference on Multimedia. New York, USA: AVM Press, 2021: 1591-1599. [21] JIANG H, TANG H, YAN M, ZHANG J, XU M, HU Y, ZHU J, NIE L. Revisiting unsupervised temporal a ction localization: the primacy of high-quality actionness and pseudolabels[C]//Proceedings of the 32th ACM Int ernational Conference on Multimedia. New York, USA: ACM Press, 2024: 5643–5652. [22] WANG C, WANG J, XU W. Double branch synergies with modal reinforcement for weakly supervised tempo ral action detection[J].Journal of Visual Communication and Image Representation, 2024, 99: 104090-104097. [23] LI Z, WANG Z, LIU Q. Weakly supervised temporal a ction localization with actionness-guided false positive s uppression[J]. Neural networks: the official journal of t he International Neural Network Society, 2024, 175: 10 6307-106318. [24] YUN W, QI M, WANG C, MA H. Weakly-Supervised temporal action localization by inferring salient snippet-feature[C]//Proceedings of the AAAI Conference on Ar tificial Intelligence. [S. l.]: AAAI Press, 2024: 6908-69 16. [25] YANG W, ZHANG T, YU X, QI T, ZHANG Y, WU F. Uncertainty guided collaborative training for weakly supervised temporal action detection[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Nashville, TN, USA: IEEE/ CVF Press, 2021: 53–63. [26] LUO Z, GUILLORY D, SHI B, KE W, WAN F, DA RRELL T, XU H. Weakly-supervised action localization with expectation-maximization multiinstance learning[C] //Proceedings of the 16th European Conference on Com puter Vision. Cham, Switzerland: Springer Press, 2020: 729-745. [27] RADFORD A, KIM J W, HALLACY C, RAMESH A, GOH G, AGARWAL S, SASTRY G, ASKELL A, M ISHKIN P, CLARK J, et al. Learning transferable vis ual models from natural language supervision[C]//Procee dings of the International Conference on Machine Learn ing. New York, USA: PMLR Press, 2021: 8748-8763. [28] JU C, HAN T, ZHENG K, ZHANG Y, XIE W. Prom pting visual-language models for efficient video underst anding[C]//Proceedings of the 17th European Conference on Computer Vision. Cham, Switzerland: Springer Pres s, 2021: 105-124. [29] LEI J, YU L, BERG T L, BANSAL M. Tvqa+: Spatio-temporal grounding for video question answering[EB/O L]. [2019-04-225]. https://arxiv.org/abs/1904.11574. [30] CARREIRA J, ZISSERMAN A. Quo vadis, action reco gnition? a new model and the kinetics dataset[C]//Proce edings of the 2017 IEEE Conference on Computer Visi on and Pattern Recognition. Honolulu, HI, USA: CVPR Press, 2017: 4724-4733. [31] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Rec ognition. Seattle, WA, USA: CVPR Press, 2020: 200-2 10. [32] CICEK O, ABDULKADIR A, LIENKAMP S S, BRO X T, RONNEBERGER O. 3D U-Net: learning dense v olumetric segmentation from sparse annotation[C]//Proce edings of the 19th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham, Switzerland: Springer Press, 2016: 424-432. [33] LI X, ZHONG Z, WU J, YANG Y, LIN Z, LIU H. E xpectation-maximization attention networks for semantic segmentation[C]//Proceedings of 2019 IEEE/CVF Intern ational Conference on Computer Vision. Seoul, Korea (South): IEEE/CVF Press, 2019: 9166-9175. [34] ALDOUS D, FILL J. Reversible markov chains and ra ndom walks on graphs[J].Journal of Theoretical Probabil ity, 1999, 2(1):91-100. [35] PENNINGTON J, SOCHER R, MANNING C D. Glov e: global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natur al language processing (EMNLP). Doha, Qatar: Associat ion for Computational Linguistics Press, 2014: 1532–15 43. [36] LIN Z, ZHAO Z, ZHANG Z, WANG Q, LIU H. Wea kly-supervised video moment retrieval via semantic com pletion network[C]//Proceedings of the AAAI Conferenc e on Artificial Intelligence.[S. l.]: AAAI Press, 2020: 1 1539–11546. [37] IDREES H, ZAMIR A, JIANG Y, GORBAN A, LAPT EV I, SUKTHANKAR R, SHAH M. The THUMOS ch allenge on action recognition for videos "in the wild"[J]. Computer Vision and Image Understanding, 2017, 155: 1-23. [38] CABA HEILBRON F, ESCORCIA V, GHANEM B, et al. ActivityNet:a large-scale video benchmark for huma n activity understanding[C]//Proceedings of the 2015 IE EE Conference on Computer Vision and Pattern Recogn ition. Piscataway, NJ: IEEE Press, 2015: 961-970. [39] LIN T, ZHAO X, SU H, WANG C, YANG M. BSN: boundary sensitive network for temporal action proposal generation[C]//Proceedings of the 15th European Confe rence on Computer Vision. Cham, Switzerland: Springer Press, 2018: 3-21. [40] LONG F, YAO T, QIU Z, TIAN X, LUO J, MEI T. Gaussian temporal awareness networks for action localiz ation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA: CVPR Press, 2019: 344–353. [41] ZHAO P, XIE L, JU C, ZHANG Y, TIAN Q. Bottom up temporal action localization with mutual regularizatio n[C]//Proceedings of the 16th European Conference on Computer Vision. Cham, Switzerland: Springer Press, 2 020: 539-555. [42] CHEN M, GAO J, YANG S, AND XU C. Dual-Evide ntial learning for weakly-supervised temporal action loc alization[C]//Proceedings of the 17th European Conferen ce on Computer Vision. Cham, Switzerland: Springer P ress, 2022: 192-208. [43] TANG X, FAN J, LUO C, ZHANG Z, ZHANG M, Y ANG Z. DDG-Net: discriminability-driven graph networ k for weakly-supervised temporal action localization[C]// Proceedings of the European Conference on Computer Vision. Cham, Switzerland: Springer Press, 2023: 6599 6609. [44] REN H, YANG W, ZHANG T, ZHANG Y. Proposal based multiple instance learning for weakly-supervised t emporal action localization[C]//Proceedings of the 2023 IEEE Conference on Computer Vision and Pattern Reco gnition. Vancouver, BC, Canada: CVPR Press, 2023: 2 394-2404. [45] ZHANG S C, ZHAO C H. Cross-Video contextual kno wledge exploration and exploitation for ambiguity reduc tion in weakly supervised temporal action localization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 34(6): 4568–4580. [46] HUANG L, WANG L, LI H. Foreground-Action consis tency network for weakly supervised temporal action lo calization[C]//Proceedings of 2021 IEEE/CVF Internation al Conference on Computer Vision. Montreal, QC, Cana da: IEEE/CVF Press, 2021: 7982–7991. [47] LI J, YANG T, JI W, WANG I, CHENG L. Exploring denoised cross-video contrast for weakly-supervised te mporal action localization[C]//Proceedings of the 2022 I EEE Conference on Computer Vision and Pattern Reco gnition. New Orleans, LA, USA: CVPR Press, 2022: 1 9882–19892. [48] 曹雨欣.弱监督时序动作增量定位方法研究[D].陕西: 西安理工大学, 2024. CAO X Y. Research on incremental positioning method of weakly supervised sequential motion[D].Shaanxi: Xi ' an University of Technology, 2024.(in Chinese) [49] ZHAO Y, ZHANG H, GAO Z, GUAN W, WANG M, CHEN S. A snippets relation and hard-snippets mask network for weakly supervised temporal action localizati on[J].IEEE Transactions on Circuits and Systems for Vi deo Technology, 2024, 34(8): 7202-7215. [50] CHO K, MERRIENBOER B V, GULCEHRE C, BAH DANAU D, BOUGARES F, SCHWENK H, BENGIO Y. Learning phrase representations using rnn encoder-de coder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natur al Language Processing (EMNLP). Doha, Qatar: Associ ation for Computational Linguistics Press, 2014: 1724–1 734. [51] HOCHREITER S SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735–1780

Please choose a citation manager

Content to export