作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 236-244. doi: 10.19678/j.issn.1000-3428.0067302

• 图形图像处理 • 上一篇    下一篇

基于特征融合的多波段图像描述生成方法

贺姗1, 蔺素珍1, 王彦博1, 李大威2   

  1. 1. 中北大学计算机科学与技术学院, 山西 太原 030051;
    2. 中北大学控制工程学院, 山西 太原 030051
  • 收稿日期:2023-03-29 修回日期:2023-06-30 发布日期:2023-09-05
  • 通讯作者: 蔺素珍,E-mail:lsz@nuc.edu.cn E-mail:lsz@nuc.edu.cn
  • 基金资助:
    山西省研究生创新项目(2022Y630)。

Multi-Band Image Caption Generation Method Based on Feature Fusion

HE Shan1, LIN Suzhen1, WANG Yanbo1, LI Dawei2   

  1. 1. College of Computer Science and Technology, North University of China, Taiyuan 030051, Shanxi, China;
    2. College of Control Engineering, North University of China, Taiyuan 030051, Shanxi, China
  • Received:2023-03-29 Revised:2023-06-30 Published:2023-09-05

摘要: 针对现有图像描述生成方法普遍存在的对夜间场景、目标被遮挡情景和拍摄模糊图像描述效果不佳的问题,提出一种基于特征融合的多波段探测图像描述生成方法。将红外探测成像引入图像描述领域,首先利用多层卷积神经网络(CNN)对可见光图像和红外图像分别提取特征;然后根据不同探测波段的互补性,以多头注意力机制为主体设计空间注意力模块,以融合目标波段特征;接着应用通道注意力机制聚合空间域信息,指导生成不同类型的单词;最后在传统加性注意力机制的基础上构建注意力增强模块,计算注意力结果图与查询向量的相关权重系数,消除无关变量的干扰,从而实现图像描述生成。在可见光图像-红外图像描述数据集上进行多组实验,结果表明,该方法能有效融合双波段的语义特征,BLEU4指标、CIDEr指标分别达到58.3%和136.1%,能显著提高图像描述准确度,可以用于安防监控、军事侦察等复杂场景任务。

关键词: 图像描述, 图像融合, 多波段图像, 自注意力机制, 组合注意力

Abstract: This study proposes a multi-band detection image caption generation method based on feature fusion to address the common problem of poor performance in describing nighttime scenes, occluded target scenes, and captured blurred images in existing image caption generation methods. Incorporating infrared detection imaging into image captioning involves a sequential process. Initially, multi-layer Convolutional Neural Networks (CNN) are employed to independently extract features from both visible light and infrared images. Subsequently, to harness the complementary nature of these different detection bands, a spatial attention module, primarily structured around a multi-head attention mechanism, is developed to integrate the features from each specific band. Finally, a channel attention mechanism is used to consolidate information across the spatial domain, thereby facilitating the generation of diverse word types tailored to the captured images. Based on the traditional additive attention mechanism, an attention enhancement module is constructed to calculate the correlation weight coefficients between the attention result graph and the query vector, eliminate the interference of irrelevant variables, and thus achieve image caption generation. Multiple experiments on the visible image-infrared image caption dataset demonstrate that the method can effectively fuse semantic features of dual bands. The application of the Bilingual Evaluation Understudy4 (BLEU4) and Consensus-based Image Description Evaluation (CIDEr) indices demonstrate substantial improvements in image caption accuracy reaching scores of 58.3% and 136.1%, respectively. These enhancements significantly bolster the utility of this technology for complex scene analysis tasks such as security monitoring and military reconnaissance.

Key words: image caption, image fusion, multi-band image, self-attention mechanism, combined attention

中图分类号: