Research Hotspots and Reviews
ZHAO Hong, GUO Lan, CHEN Zhiwen, ZHENG Houze
Aiming at the challenges of single-text representation and low accuracy of existing video content text-representation models, a video content text-reprsentation model that integrates frame-level image and audio information is proposed.The network structure of the model includes a single-mode embedding layer based on a self attention mechanism, and learns single-mode feature parameters.Two schemes, joint-representation and cooperative-representation, are adopted to fuse high-dimensional feature vectors output from the single-mode embedding layer, so that the model can focus on different objects in the video and their interaction, thereby generating richer and more accurate video text representation.The model is pretrained through large-scale datasets, and representation information, such as video frames and audio carried by the video, are extracted and sent to the coder to realize the text representation of the video content.The experimental results on MSR-VTT and LSMDC datasets show that the BLEU4, METEOR, ROUGEL, and CIDEr scores of the proposed model are 0.386, 0.250, 0.609 and 0.463 respectively.Compared with the model released by the IIT DeIhi in the MSR-VTT challenge, the proposed model improves the indexes above by 0.082, 0.037, 0.115 and 0.257 respectively.The model in this study can effectively improve the accuracy of the video content text-representation model.