作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (1): 31-41. doi: 10.19678/j.issn.1000-3428.0070064

• 基于感知信息的图像处理 • 上一篇    下一篇

场景结构知识增强的协同显著性目标检测

胡升龙, 陈彬, 张开华, 宋慧慧*()   

  1. 南京信息工程大学自动化学院, 江苏 南京 210044
  • 收稿日期:2024-07-01 出版日期:2025-01-15 发布日期:2025-01-18
  • 通讯作者: 宋慧慧
  • 基金资助:
    国家自然科学基金(62276141); 2024年江苏省研究生科研创新计划项目(KYCX24_1508)

Co-Saliency Object Detection Enhanced by Scene Structure Knowledge

HU Shenglong, CHEN Bin, ZHANG Kaihua, SONG Huihui*()   

  1. School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, Jiangsu, China
  • Received:2024-07-01 Online:2025-01-15 Published:2025-01-18
  • Contact: SONG Huihui

摘要:

现有的协同显著性目标检测(CoSOD)方法通过挖掘组内一致性与组间差异性来学习判别性表征, 这种范式因缺乏语义标签的引导导致所学表征的判别性受限, 难以应对复杂的非协同目标的干扰。为了解决这一问题, 提出一种新的场景结构知识增强的CoSOD模型SSKNet。SSKNet利用大模型mPlug构建目标间场景结构语义关系并通过分割一切模型(SAM)将这种结构语义关系转移到最终的协同显著性结果中。具体来说: 首先, 为了学习语义知识, 引入图像场景理解大模型, 对图像组中的图像进行理解, 并得到表示结构语义的文本描述组, 这些文本描述组以文本的形式描述图像的显著内容; 接着, 为了获取协同显著信息, 设计协同提示提取(CoPE)模块, 通过在一组描述文本中使用协同注意力机制获取协同显著文本; 最后, 为了将协同显著文本转化为协同显著掩码, 引入SAM, 将协同显著文本以文本提示的方式引导SAM分割协同显著目标, 获取最终的协同显著检测掩码。在3个公开数据集CoSal2015、CoCA和CoSOD3k上的实验结果表明, SSKNet模型的综合评估指标Fβ的取值分别为0.910、0.750和0.887, 达到了先进水平。

关键词: 场景结构知识, 大模型, 分割一切模型, 协同显著性目标检测, 深度学习

Abstract:

Existing Co-Saliency Object Detection (CoSOD) methods focus on learning discriminative representations by mining intergroup separability and intragroup compactness. However, these methods are constrained by the lack of semantic label guidance, which reduces the discriminative capacity of the learned representations and complicates the handling of complex background interference. To address this issue, this research proposes a novel model for CoSOD, enhanced by scene structure knowledge, called SSKNet. SSKNet leverages a large model, called mPlug, to construct scene structural-semantic relationships among objects and uses the Segment Anything Model (SAM) to transfer these structural-semantic relationships into the final co-saliency results. Specifically, to acquire semantic knowledge, SSKNet first introduces an image caption model to interpret images in a group and obtain a set of text descriptions that represent structural semantics. These text descriptions describe the salient content of images in textual form. Subsequently, to obtain co-saliency information, a Common Prompt Extract (CoPE) module is designed that uses the co-attention mechanism on a group of text descriptions to extract co-salient texts. Finally, to convert the co-saliency texts into co-saliency masks, SAM is utilized to guide the segmentation of co-salient objects using the co-salient texts as textual prompts, resulting in the final co-saliency detection masks. SSKNet is validated using three public datasets: CoSal2015, CoCA, and CoSOD3k. The values of the comprehensive evaluation index Fβ are 0.910, 0.750, and 0.887 on the three datasets, demonstrating that the proposed SSKNet achieves an advanced performance level.

Key words: scene structure knowledge, large model, Segment Anything Model (SAM), Co-Saliency Object Detection (CoSOD), deep learning