Multifaceted Feature Coding Image Caption Generation Algorithm Based on Transformer

doi:10.19678/j.issn.1000-3428.0064450

Abstract

Abstract: Object features extracted by object detection algorithms play an increasingly critical role in the generation of image captions.However, only using the features of object detection as the input of an image caption task can lead to the loss of other information except the key object information and generation of a caption that lacks an accurate expression of its relationship with the image object.To solve these disadvantages, an object Transformer encoder for encoding object features in an image and a shift window Transformer for encoding relational features in an image are proposed to make joint efforts to encode different aspects of information in an image.The object features of the object Transformer encoder are fused with the relational features of the shift window Transformer by splicing method, to achieve the purpose of fusion of the internal relational and local object features.Finally, a Transformer decoder is utilized to decode the fused coding features and generate the corresponding image caption.Extensive experiments on the Common Objects in COntext(MS-COCO) dataset and comparison with the current classical model algorithm show that the performance of the proposed model is significantly better than that of the baseline model.The experimental results indicate that the scores of BiLingual Evaluation Understudy 4-gram(BLEU-4), Metric for Evaluation of Translation with Explicit ORdering(METEOR), Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence(ROUGE-L), and Consensus-based Image Description Evaluation(CIDEr) metrics can reach 38.6%, 28.7%, 58.2% and 127.4% respectively, better than those of the traditional image caption algorithm.Moreover, it can generate more detailed and accurate captions.

Key words: image caption, shift window, multi-headed attention mechanism, multimodal task, Transformer encoder

摘要： 由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用，但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失，且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足，提出用于编码图像内目标特征的目标Transformer编码器，以及用于编码图像内关系特征的转换窗口Transformer编码器，从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合，达到图像内部关系特征和局部目标特征融合的目的，最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验，结果表明，所构建模型性能明显优于基线模型，BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%，优于传统图像描述网络模型，能够生成更详细准确的图像描述。

关键词: 图像描述, 转换窗口, 多头注意力机制, 多模态任务, Transformer编码器

CLC Number:

TP391.41

HENG Hongjun, FAN Yuchen, WANG Jialiang. Multifaceted Feature Coding Image Caption Generation Algorithm Based on Transformer[J]. Computer Engineering, 2023, 49(2): 199-205.

衡红军, 范昱辰, 王家亮. 基于Transformer的多方面特征编码图像描述生成算法[J]. 计算机工程, 2023, 49(2): 199-205.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064450

http://www.ecice06.com/EN/Y2023/V49/I2/199

Figures/Tables 8

References

[1] KULKARNI G, PREMRAJ V, ORDONEZ V, et al.Babytalk:understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903.
[2] LI S, KULKARNI G, BERG T, et al.Composing simple image descriptions using web-scale n-grams[C]//Proceedings of the 15th Conference on Computational Natural Language Learning.Washington D.C., USA:IEEE Press, 2011:220-228.
[3] HODOSH M, YOUNG P, HOCKENMAIER J.Framing image description as a ranking task:data, models and evaluation metrics[J].Journal of Artificial Intelligence Research, 2013, 47:853-899.
[4] MAO J H, XU W, YANG Y, et al.Explain images with multimodal recurrent neural networks[EB/OL].[2022-03-12].https://arxiv.org/abs/1410.1090.
[5] VINYALS O, TOSHEV A, BENGIO S, et al.Show and tell:a neural image caption generator[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3156-3164.
[6] KARPATHY A, LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of IEEE Transactions on Pattern Analysis and Machine Intelligence.Washington D.C., USA:IEEE Press, 2016:664-676.
[7] YANG Z L, YUAN Y, WU Y X, et al.Encode, review, and decode:reviewer module for caption generation[EB/OL].[2022-03-12].https://arxiv.org/abs/1605.07912.
[8] XU K, BA J, KIROS R, et al.Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning.New York, USA:ACM Press, 2015:2048-2057.
[9] LU J S, XIONG C M, PARIKH D, et al.Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:3242-3250.
[10] ANDERSON P, HE X D, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[EB/OL].[2022-03-12].https://arxiv.org/abs/1707.07998.
[11] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all You need[EB/OL].[2022-03-12].https://arxiv.org/abs/1706.03762.
[12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An image is worth 16×16 words:transformers for image recognition at scale[EB/OL].[2022-03-12].https://arxiv.org/abs/2010.11929.
[13] LIU Z, LIN Y T, CAO Y, et al.Swin Transformer:hierarchical vision transformer using shifted windows[EB/OL].[2022-03-12].https://arxiv.org/abs/2103.14030.
[14] REN S Q, HE K M, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
[15] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin, Germany:Springer, 2014:740-755.
[16] PAPINENI K, ROUKOS S, WARD T, et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.New York, USA:ACM Press, 2002:311-318.
[17] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:4566-4575.
[18] DENKOWSKI M, LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation.Stroudsburg, USA:Association for Computational Linguistics, 2014:376-380.
[19] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of Workshop on Text Summarization Branches Out.Barcelona, Spain:[s.n.], 2004:74-81.
[20] DENG J, DONG W, SOCHER R, et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2009:248-255.
[21] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:770-778.
[22] RENNIE S J, MARCHERET E, MROUEH Y, et al.Self-critical sequence training for image captioning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:1179-1195.
[23] YAO T, PAN Y W, LI Y H, et al.Boosting image captioning with attributes[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:4904-4912.
[24] 韦人予, 蒙祖强.基于注意力特征自适应校正的图像描述模型[J].计算机应用, 2020, 40(S1):45-50. WEI R Y, MENG Z Q.Image caption model based on attention feature adaptive recalibration[J].Journal of Computer Applications, 2020, 40(S1):45-50.(in Chinese)
[25] ZHONG X, NIE G Z, HUANG W X, et al.Attention-guided image captioning with adaptive global and local feature fusion[J].Journal of Visual Communication and Image Representation, 2021, 78:103138.
[26] YUN J, XU Z W, GAO G L.Gated object-attribute matching network for detailed image caption[J].Mathematical Problems in Engineering, 2020, 2020:1-11.

Please choose a citation manager

Content to export