Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2023, Vol. 49 ›› Issue (2): 199-205. doi: 10.19678/j.issn.1000-3428.0064450

• Graphics and Image Processing • Previous Articles     Next Articles

Multifaceted Feature Coding Image Caption Generation Algorithm Based on Transformer

HENG Hongjun, FAN Yuchen, WANG Jialiang   

  1. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Received:2022-04-12 Revised:2022-05-20 Published:2022-06-21

基于Transformer的多方面特征编码图像描述生成算法

衡红军, 范昱辰, 王家亮   

  1. 中国民航大学 计算机科学与技术学院, 天津 300300
  • 作者简介:衡红军(1968-),男,副教授、博士,主研方向为图像描述;范昱辰(通信作者),硕士研究生;王家亮,讲师、博士。
  • 基金资助:
    国家自然科学基金(U1333109)。

Abstract: Object features extracted by object detection algorithms play an increasingly critical role in the generation of image captions.However, only using the features of object detection as the input of an image caption task can lead to the loss of other information except the key object information and generation of a caption that lacks an accurate expression of its relationship with the image object.To solve these disadvantages, an object Transformer encoder for encoding object features in an image and a shift window Transformer for encoding relational features in an image are proposed to make joint efforts to encode different aspects of information in an image.The object features of the object Transformer encoder are fused with the relational features of the shift window Transformer by splicing method, to achieve the purpose of fusion of the internal relational and local object features.Finally, a Transformer decoder is utilized to decode the fused coding features and generate the corresponding image caption.Extensive experiments on the Common Objects in COntext(MS-COCO) dataset and comparison with the current classical model algorithm show that the performance of the proposed model is significantly better than that of the baseline model.The experimental results indicate that the scores of BiLingual Evaluation Understudy 4-gram(BLEU-4), Metric for Evaluation of Translation with Explicit ORdering(METEOR), Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence(ROUGE-L), and Consensus-based Image Description Evaluation(CIDEr) metrics can reach 38.6%, 28.7%, 58.2% and 127.4% respectively, better than those of the traditional image caption algorithm.Moreover, it can generate more detailed and accurate captions.

Key words: image caption, shift window, multi-headed attention mechanism, multimodal task, Transformer encoder

摘要: 由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用,但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失,且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足,提出用于编码图像内目标特征的目标Transformer编码器,以及用于编码图像内关系特征的转换窗口Transformer编码器,从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合,达到图像内部关系特征和局部目标特征融合的目的,最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验,结果表明,所构建模型性能明显优于基线模型,BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%,优于传统图像描述网络模型,能够生成更详细准确的图像描述。

关键词: 图像描述, 转换窗口, 多头注意力机制, 多模态任务, Transformer编码器

CLC Number: