作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (11): 308-317. doi: 10.19678/j.issn.1000-3428.0069303

• 图形图像处理 • 上一篇    下一篇

全景分割与多视觉特征协同的图像描述生成方法

刘明明1,2,*(), 陆劲夫2, 刘浩2, 张海燕1   

  1. 1. 江苏建筑职业技术学院智能制造学院, 江苏 徐州 221116
    2. 中国矿业大学计算机科学与技术学院, 江苏 徐州 221116
  • 收稿日期:2024-01-26 出版日期:2024-11-15 发布日期:2024-06-20
  • 通讯作者: 刘明明
  • 基金资助:
    国家自然科学基金(61801198); 江苏省自然科学基金(BK20180174)

Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion

LIU Mingming1,2,*(), LU Jinfu2, LIU Hao2, ZHANG Haiyan1   

  1. 1. School of Intelligent Manufacturing, Jiangsu Vocational Institute of Architectural Technology, Xuzhou 221116, Jiangsu, China
    2. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China
  • Received:2024-01-26 Online:2024-11-15 Published:2024-06-20
  • Contact: LIU Mingming

摘要:

现有基于Transformer架构的图像描述生成模型取得了较好的泛化性能, 然而, 大多数方法通常使用区域视觉特征进行编解码, 导致无法全面利用整幅图像的细粒度信息, 且存在视觉特征混淆问题。为此, 将全景分割引入图像描述生成过程, 使用基于全景分割的掩膜视觉特征代替区域视觉特征, 提出一种全景分割与多视觉特征协同的图像描述生成方法。该方法不仅可以有效解耦视觉表征, 而且能够充分结合掩膜视觉特征和网格视觉特征的优势, 提升图像描述生成的可解释性和描述性能。在MSCOCO标准数据集上进行定量和定性实验, 结果表明, 所提方法不仅可以显著提升现有模型的性能, 同时能够增强图像描述生成过程的可解释性, CIDEr和BLEU-4指标分别达到138.5和41。

关键词: 图像理解, 图像描述生成, 全景分割, 特征融合, 视觉编码

Abstract:

Due to their powerful sequence modeling capabilities, Transformer-based image captioning models have demonstrated remarkable performance. However, most of these models typically utilize region visual features to perform encoding and decoding, which cannot fully use the fine-grained information of the whole image, and this leads to visual feature confusion. Accordingly, we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion. Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance. We perform quantitative and qualitative experiments on the MSCOCO dataset, which demonstrate that our method significantly outperforms existing Transformer-based image captioning models. In addition, our model enhances the interpretability of the caption generation process, and more specifically, achieves CIDEr and BLEU-4 scores of 138.5 and 41, respectively.

Key words: image understanding, image description generation, panoptic segmentation, feature fusion, visual encoding