Dual-Process Short Video Classification Method Based on Deep Learning

doi:10.19678/j.issn.1000-3428.0061913

Abstract

Abstract: As the smartphones and 5G networks have become increasingly popular, short videos have become the medium through which people to acquire knowledge in a short time.Inspired by the shortage of short video datasets in real-life scenarios and low accuracy of short video classification, this study proposes a dual-process short video classification method integrating the deep learning technology.In the main process, a A-VGG-3D network model is constructed.Then, a VGG network with an attention mechanism is used to extract features, while the optimized 3D Convolutional Neural Network(3DCNN) is used for short video classification, which can improve the continuity, balance, and robustness of short videos in the temporal dimension.In the auxiliary process, the frame difference method is used to conduct shot switching to extract several frames from the short videos.Then, multi-scale face detection is performed on the extracted frames by integrating the sliding window mechanism and cascade classifier, which can further improve the short video classification accuracy.The experimental results demonstrate that the precision and recall of this method for non-plot and non-interview short videos on the UCF101 dataset and a self-built short video dataset of life scenes are 98.9% and 98.6%, respectively.Compared with the short video classification method based on a C3D network, the classification accuracy of the proposed method on the UCF101 dataset is 9.7 percentage points higher, which signifies that the proposed method more universally accurate.

Key words: 3D Convolutional Neural Network(3DCNN), deep learning, VGG network, attention mechanism, short video classification

摘要： 随着智能手机和5G网络的普及，短视频已经成为人们碎片时间获取知识的主要途径。针对现实生活场景短视频数据集不足及分类精度较低等问题，提出融合深度学习技术的双流程短视频分类方法。在主流程中，构建A-VGG-3D网络模型，利用带有注意力机制的VGG网络提取特征，采用优化的3D卷积神经网络进行短视频分类，提升短视频在时间维度上的连续性、平衡性和鲁棒性。在辅助流程中，使用帧差法判断镜头切换抽取出短视频中的若干帧，通过滑动窗口机制与级联分类器融合的方式对其进行多尺度人脸检测，进一步提高短视频分类准确性。实验结果表明，该方法在UCF101数据集和自建的生活场景短视频数据集上对于非剧情类与非访谈类短视频的查准率和查全率最高达到98.9%和98.6%，并且相比基于C3D网络的短视频分类方法，在UCF101数据集上的分类准确率提升了9.7个百分点，具有更强的普适性。

关键词: 3D卷积神经网络, 深度学习, VGG网络, 注意力机制, 短视频分类

CLC Number:

TP391

ZHANG Aihan, LIU Xiang, SHI Yunyu, LIU Siqi. Dual-Process Short Video Classification Method Based on Deep Learning[J]. Computer Engineering, 2022, 48(7): 277-283.

张瑷涵, 刘翔, 石蕴玉, 刘思齐. 基于深度学习的双流程短视频分类方法[J]. 计算机工程, 2022, 48(7): 277-283.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0061913

http://www.ecice06.com/EN/Y2022/V48/I7/277

Figures/Tables 10

References

[1] SOOMRO K, ZAMIR A R, SHAH M.UCF101:a dataset of 101 human actions classes from videos in the wild[EB/OL].[2021-05-17].https://arxiv.org/abs/1212.0402.
[2] KUEHNE H, JHUANG H, GARROTE E, et al.HMDB:a large video database for human motion recognition[C]//Proceedings of International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2011:2556-2563.
[3] HEILBRON F C, ESCORCIA V, GHANEM B, et al.ActivityNet:a large-scale video benchmark for human activity understanding[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:961-970.
[4] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[5] KARPATHY A, TODERICI G, SHETTY S, et al.Large-scale video classification with convolutional neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:1725-1732.
[6] TRAN D, BOURDEV L, FERGUS R, et al.Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of 2015 IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4489-4497.
[7] CARREIRA J, ZISSERMAN A.Quo Vadis, action recognition? A new model and the Kinetics dataset[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:4724-4733.
[8] 杨曙光.一种改进的深度学习视频分类方法[J].现代计算机, 2017(8):66-69. YANG S G.An improved video classification method of deep learning[J].Modern Computer, 2017(8):66-69.(in Chinese)
[9] 廖小东, 贾晓霞.基于改进型C3D神经网络的动作识别技术[J].计算机与现代化, 2019(3):32-38. LIAO X D, JIA X X.Action recognition technology based on improved C3D neural network[J].Computer and Modernization, 2019(3):32-38.(in Chinese)
[10] 王倩, 孙宪坤, 范冬艳.基于深度学习的时空特征融合人体动作识别[J].传感器与微系统, 2020, 39(10):35-38. WANG Q, SUN X K, FAN D Y.Fusion of spatio-temporal features based on deep learning for human action recognition[J].Transducer and Microsystem Technologies, 2020, 39(10):35-38.(in Chinese)
[11] 李钊光.基于深度学习和迁移学习的体育视频分类研究[J].电子测量技术, 2020, 43(18):21-25. LI Z G.Research on sports video classification based on deep learning and transfer learning[J].Electronic Measurement Technology, 2020, 43(18):21-25.(in Chinese)
[12] HARA K, KATAOKA H, SATOH Y.Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:6546-6555.
[13] HE K M, ZHANG X Y, REN S Q, et al.Identity Mappings in Deep Residual Networks[M].Berlin, Germany:Springer, 2016.
[14] XIE S N, GIRSHICK R, DOLLÁR P, et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:5987-5995.
[15] HUANG G, LIU Z, VAN DER MAATEN L, et al.Densely connected convolutional networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:2261-2269.
[16] 陈意, 黄山.基于改进NeXtVLAD的视频分类[J].计算机工程与设计, 2021, 42(3):749-754. CHEN Y, HUANG S.Video classification based on improved NeXtVLAD[J].Computer Engineering and Design, 2021, 42(3):749-754.(in Chinese)
[17] TRAN D, BOURDEV L, FERGUS R, et al.Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4489-4497.
[18] SIMONYAN K, ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2021-05-17].https://arxiv.org/abs/1409.1556.
[19] 李梦洁, 董峦.基于PyTorch的机器翻译算法的实现[J].计算机技术与发展, 2018, 28(10):160-163, 167. LI M J, DONG L.Implementation of machine translation algorithm based on PyTorch[J].Computer Technology and Development, 2018, 28(10):160-163, 167.(in Chinese)
[20] HU J, SHEN L, SUN G.Squeeze-and-excitation networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7132-7141.
[21] BAHDANAU D, CHO K, BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2021-05-17].http://aps.arxiv.org/abs/1409.0473v2.
[22] KRIZHEVSKY A, SUTSKEVER I, HINTON G E.ImageNet classification with deep convolutional neural networks[J].Communications of the ACM, 2017, 60(6):84-90.
[23] BARTZ C, HEROLD T, YANG H, et al.Language identification using deep convolutional recurrent neural networks[M].Berlin, Germany:Springer, 2017.
[24] SIMONYAN K, ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[EB/OL].[2021-05-17].https://arxiv.org/abs/1406.2199.
[25] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al.Beyond short snippets:deep networks for video classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:4694-4702.
[26] 智洪欣, 于洪涛, 李邵梅.基于时空域深度特征两级编码融合的视频分类[J].计算机应用研究, 2018, 35(3):926-929. ZHI H X, YU H T, LI S M.Video classification based on cascaded encoding fusion of temporal and spatial deep features[J].Application Research of Computers, 2018, 35(3):926-929.(in Chinese)

Please choose a citation manager

Content to export