作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (8): 229-238. doi: 10.19678/j.issn.1000-3428.0068402

• 图形图像处理 • 上一篇    下一篇

基于Transformer视觉特征融合的图像描述方法

白雪冰1,2, 车进2,3,*(), 吴金蔓2,3, 陈玉敏2,3   

  1. 1. 宁夏大学前沿交叉学院, 宁夏 中卫 755000
    2. 宁夏大学宁夏沙漠信息智能感知重点实验室, 宁夏 银川 750021
    3. 宁夏大学电子与电气工程学院, 宁夏 银川 750021
  • 收稿日期:2023-09-17 出版日期:2024-08-15 发布日期:2023-12-28
  • 通讯作者: 车进
  • 基金资助:
    国家自然科学基金(62366042); 宁夏自然科学基金(2023AAC03127)

Image Captioning Method Based on Transformer Visual Features Fusion

Xuebing BAI1,2, Jin CHE2,3,*(), Jinman WU2,3, Yumin CHEN2,3   

  1. 1. School of Advanced Interdisciplinary, Ningxia University, Zhongwei 755000, Ningxia, China
    2. Ningxia Key Laboratory of Intelligent Sensing for Desert Information, Ningxia University, Yinchuan 750021, Ningxia, China
    3. School of Electronic and Electrical Engineering, Ningxia University, Yinchuan 750021, Ningxia, China
  • Received:2023-09-17 Online:2024-08-15 Published:2023-12-28
  • Contact: Jin CHE

摘要:

现有图像描述方法只利用区域型视觉特征生成描述语句, 忽略了网格型视觉特征的重要性, 并且均为两阶段方法, 从而影响了图像描述的质量。针对该问题, 提出一种基于Transformer视觉特征融合的端到端图像描述方法。首先, 在特征提取阶段, 利用视觉特征提取器提取出区域型视觉特征和网格型视觉特征; 其次, 在特征融合阶段, 通过视觉特征融合模块对区域型视觉特征和网格型视觉特征进行拼接; 最后, 将所有的视觉特征送入语言生成器中以生成图像描述。该方法各部分均基于Transformer模型实现, 实现了一阶段方法。在MS-COCO数据集上的实验结果表明, 所提方法能够充分利用区域型视觉特征与网格型视觉特征的优势, BLEU-1、BLEU-4、METEOR、ROUGE-L、CIDEr、SPICE指标分别达到83.1%、41.5%、30.2%、60.1%、140.3%、23.9%, 优于目前主流的图像描述方法, 能够生成更加准确和丰富的描述语句。

关键词: 图像描述, 区域型视觉特征, 网格型视觉特征, Transformer模型, 端到端训练

Abstract:

Existing image captioning methods only use regional visual features to generate description statements and ignore the importance of grid visual features. Moreover, as these methods are two-stage approaches, image captioning quality is affected. To address this issue, this study proposes an end-to-end image captioning method based on the visual feature fusion of Transformer. First, in the feature extraction stage, the visual feature extractor is used to extract regional and grid visual features. Second, in the feature fusion stage, the regional and grid visual features are concatenated using a visual feature fusion module. Finally, the visual features are sent to the language generator to realize image captioning. All components of the method are implemented based on the Transformer model, which is a one-stage method. The experimental results on the MS-COCO dataset show that the proposed method can fully utilize the respective advantages of regional and grid visual features, with the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics reaching 83.1%, 41.5%, 30.2%, 60.1%, 140.3%, and 23.9%, respectively, indicating that the proposed method is superior to mainstream image captioning methods and can generate more accurate and rich description statements.

Key words: image captioning, regional visual features, grid visual features, Transformer model, end-to-end training