全景分割与多视觉特征协同的图像描述生成方法

doi:10.19678/j.issn.1000-3428.0069303

摘要/Abstract

摘要：

现有基于Transformer架构的图像描述生成模型取得了较好的泛化性能, 然而, 大多数方法通常使用区域视觉特征进行编解码, 导致无法全面利用整幅图像的细粒度信息, 且存在视觉特征混淆问题。为此, 将全景分割引入图像描述生成过程, 使用基于全景分割的掩膜视觉特征代替区域视觉特征, 提出一种全景分割与多视觉特征协同的图像描述生成方法。该方法不仅可以有效解耦视觉表征, 而且能够充分结合掩膜视觉特征和网格视觉特征的优势, 提升图像描述生成的可解释性和描述性能。在MSCOCO标准数据集上进行定量和定性实验, 结果表明, 所提方法不仅可以显著提升现有模型的性能, 同时能够增强图像描述生成过程的可解释性, CIDEr和BLEU-4指标分别达到138.5和41。

关键词: 图像理解, 图像描述生成, 全景分割, 特征融合, 视觉编码

Abstract:

Due to their powerful sequence modeling capabilities, Transformer-based image captioning models have demonstrated remarkable performance. However, most of these models typically utilize region visual features to perform encoding and decoding, which cannot fully use the fine-grained information of the whole image, and this leads to visual feature confusion. Accordingly, we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion. Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance. We perform quantitative and qualitative experiments on the MSCOCO dataset, which demonstrate that our method significantly outperforms existing Transformer-based image captioning models. In addition, our model enhances the interpretability of the caption generation process, and more specifically, achieves CIDEr and BLEU-4 scores of 138.5 and 41, respectively.

Key words: image understanding, image description generation, panoptic segmentation, feature fusion, visual encoding

刘明明, 陆劲夫, 刘浩, 张海燕. 全景分割与多视觉特征协同的图像描述生成方法[J]. 计算机工程, 2024, 50(11): 308-317.

LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan. Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion[J]. Computer Engineering, 2024, 50(11): 308-317.

https://www.ecice06.com/CN/Y2024/V50/I11/308

图/表 8

图1 模型总体架构

Fig.1 Overall architecture of the model

图2 多头注意力

Fig.2 Multi-head attention

图3 掩膜视觉特征提取示意图

Fig.3 Schematic diagram of mask visual feature extraction

图4 标准Transformer模型和本文模型生成的描述对比

Fig.4 Comparison of descriptions generated by the standard Transformer model and our model

图5 基于掩膜视觉特征和网格特征的注意力权重

Fig.5 Attention weights based on mask visual features and grid features

参考文献 30

1	GHANDI T, POURREZA H, MAHYAR H. Deep learning approaches on image captioning: a review. ACM Computing Surveys, 2024, 56(3): 1- 39. doi: 10.1145/3617592
2	LI Y P, ZHANG X R, CHENG X N, et al. Learning consensus-aware semantic knowledge for remote sensing image captioning. Pattern Recognition, 2024, 145, 109893. doi: 10.1016/j.patcog.2023.109893
3	石义乐, 杨文忠, 杜慧祥, 等. 基于深度学习的图像描述综述. 电子学报, 2021, 49(10): 2048- 2060. doi: 10.12263/DZXB.20200669
	SHI Y L, YANG W Z, DU H X, et al. Overview of image captions based on deep learning. Acta Electronica Sinica, 2021, 49(10): 2048- 2060. doi: 10.12263/DZXB.20200669
4	STEFANINI M, CORNIA M, BARALDI L, et al. From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 539- 559. doi: 10.1109/TPAMI.2022.3148210
5	WANG J, XU W, WANG Q, et al. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 2088- 2103. doi: 10.1109/TPAMI.2022.3159811
6	YANG X, ZHANG H, CAI J. Deconfounded image captioning: a causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12996- 13010. doi: 10.1109/TPAMI.2021.3121705
7	HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: transforming objects into words[EB/OL]. [2023-12-05]. https://arxiv.org/abs/1906.05963.
8	GUO L T, LIU J, ZHU X X, et al. Normalized and geometry-aware self-attention network for image captioning[EB/OL]. [2023-12-05]. https://arxiv.org/abs/2003.08897.
9	LI G, ZHU L C, LIU P, et al. Entangled transformer for image captioning[EB/OL]. [2023-12-05]. https://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Entangled_Transformer_for_Image_Captioning_ICCV_2019_paper.pdf.
10	CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning[EB/OL]. [2023-12-05]. https://arxiv.org/abs/1912.08226.
11	PAN Y W, YAO T, LI Y H, et al. X-linear attention networks for image captioning[EB/OL]. [2023-12-05]. https://arxiv.org/abs/2003.14080.
12	JI J Y, LUO Y P, SUN X S, et al. Improving image captioning by leveraging intra- and inter-layer global representation in transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1655- 1663. doi: 10.1609/aaai.v35i2.16258
13	LUO Y P, JI J Y, SUN X S, et al. Dual-level collaborative transformer for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2286- 2293. doi: 10.1609/aaai.v35i3.16328
14	李志欣, 魏海洋, 黄飞成, 等. 结合视觉特征和场景语义的图像描述生成. 计算机学报, 2020, 43(9): 1624- 1640. URL
	LI Z X, WEI H Y, HUANG F C, et al. Combine visual features and scene semantics for image captioning. Chinese Journal of Computers, 2020, 43(9): 1624- 1640. URL
15	周东明, 张灿龙, 李志欣, 等. 基于多层级视觉融合的图像描述模型. 电子学报, 2021, 49(7): 1286- 1290. URL
	ZHOU D M, ZHANG C L, LI Z X, et al. Image captioning model based on multi-level visual fusion. Acta Electronica Sinica, 2021, 49(7): 1286- 1290. URL
16	刘茂福, 施琦, 聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法. 软件学报, 2022, 33(9): 3210- 3222. doi: 10.13328/j.cnki.jos.006623
	LIU M F, SHI Q, NIE L Q. Image captioning based on visual relevance and context dual attention. Journal of Software, 2022, 33(9): 3210- 3222. doi: 10.13328/j.cnki.jos.006623
17	宋井宽, 曾鹏鹏, 顾嘉扬, 等. 基于视觉区域聚合与双向协作的端到端图像描述生成. 软件学报, 2022, 34(5): 2152- 2169. doi: 10.13328/j.cnki.jos.006773
	SONG J K, ZENG P P, GU J Y, et al. End-to-end image captioning via visual region aggregation and dual-level collaboration. Journal of Software, 2022, 34(5): 2152- 2169. doi: 10.13328/j.cnki.jos.006773
18	CHENG B W, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation[EB/OL]. [2023-12-05]. http://arxiv.org/abs/2107.06278v2.
19	CHENG B W, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 1290-1299.
20	YU Q H, WANG H Y, QIAO S Y, et al. K-means mask transformer[EB/OL]. [2023-12-05]. https://link.springer.com/content/pdf/10.1007/978-3-031-19818-2_17.pdf?pdf=inline%20link.
21	WU M R, ZHANG X Y, SUN X S, et al. DIFNet: boosting visual information flow for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 18020-18029.
22	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 9992-10002.
23	KARPATHY A, JOULIN A, LI F F. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems, 2014, 3, 1889- 1897. doi: 10.5555/2969033.2969038
24	VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2015: 4566-4575.
25	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. [S. l.]: ACL, 2001: 311-318.
26	BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[EB/OL]. [2023-12-05]. https://aclanthology.org/W05-0909.pdf.
27	LIN C Y. ROUGE: a package for automatic evaluation of summaries[EB/OL]. [2023-12-05]. https://aclanthology.org/W04-1013.pdf.
28	ZHANG X Y, SUN X S, LUO Y P, et al. RSTNet: captioning with adaptive attention on visual and non-visual words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 15465-15474.
29	WANG Y Y, XU J G, SUN Y F. End-to-end transformer based model for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2585- 2594. doi: 10.1609/aaai.v36i3.20160
30	LI Y N, MA Y W, ZHOU Y Y, et al. Semantic-guided selective representation for image captioning. IEEE Access, 2023, 11, 14500- 14510. doi: 10.1109/ACCESS.2023.3243952

[1]	李俊仪, 李向阳, 龙朝勋, 李海燕, 李红松, 余鹏飞. 基于多级区域选择与跨层特征融合的野生菌分类[J]. 计算机工程, 2024, 50(9): 179-188.
[2]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[3]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[4]	刘锁兰, 王炎, 王洪元, 朱生升. 基于多流语义图卷积网络的人体行为识别[J]. 计算机工程, 2024, 50(8): 64-74.
[5]	赵婉秋, 张俊虎, 李海涛. 用于建筑物分割的平行结构特征融合网络[J]. 计算机工程, 2024, 50(8): 239-248.
[6]	赵宏, 王枭. 基于Swin-Transformer的黑色素瘤图像病灶分割研究[J]. 计算机工程, 2024, 50(8): 249-258.
[7]	王富平, 刘鸿玮, 张锲石, 段冠庄. 基于深度特征抑制的遮挡人脸识别网络[J]. 计算机工程, 2024, 50(8): 259-269.
[8]	闵莉, 董冰洁, 安冬. 基于多注意力机制与跨特征融合的语义分割算法[J]. 计算机工程, 2024, 50(8): 282-289.
[9]	陈宇航, 杨勇, 先木斯亚·买买提明, 帕力旦·吐尔逊, 樊小超, 任鸽, 刁宇峰. 基于主题感知和语义增强的作文自动评分方法[J]. 计算机工程, 2024, 50(8): 363-371.
[10]	谭巨全, 王然. 特征融合下田径录像3D人体动作DTW捕捉算法[J]. 计算机工程, 2024, 50(7): 71-78.
[11]	张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合空间特征的多尺度深度伪造检测方法[J]. 计算机工程, 2024, 50(7): 240-250.
[12]	王晋涛, 秦昂, 张元, 陈一飞, 王廷凤, 谢承霖, 邹刚. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332.
[13]	李亚康, 陈刚. 小角中子散射物理模型自动化筛选[J]. 计算机工程, 2024, 50(6): 56-64.
[14]	杨硕, 王一丁. 基于改进薄板样条运动模型的人脸动画算法[J]. 计算机工程, 2024, 50(6): 255-265.
[15]	梁松林, 林伟, 王珏, 杨庆. 面向后渗透攻击行为的网络恶意流量检测研究[J]. 计算机工程, 2024, 50(5): 128-138.

选择文件类型/文献管理软件名称

选择包含的内容