作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于文本与草图的择优增强交互式场景级图像检索系统

  • 发布日期:2026-04-02

Optimization-Enhanced Interactive Scene-Level Image Retrieval System with Text and Sketch

  • Published:2026-04-02

摘要: 交互式图像检索突破了传统单次查询-返回结果的静态范式,将检索过程重塑为多轮迭代的人机对话,允许用户依据初步结果动态引导与细化检索意图。文本与草图作为两种直观互补的查询模态,在场景级图像检索中具有显著优势,能够有效表达用户的复杂视觉需求。然而,现有方法的交互机制多基于最新即最佳的简单假设,缺乏对历史状态的择优与维持能力,导致检索过程易受噪声干扰且稳定性不足,此外,其评估指标往往仅关注是否在某一轮次检索到目标,忽视了真实交互中用户反馈含噪、意图持续演化以及检索结果稳定性不足等问题;此外,草图本身具有高度抽象性与用户绘制不确定性,现有静态检索模型难以在初始输入不完整或存在歧义时通过交互进行有效细化,导致其实用性与鲁棒性受限。为此,本文提出一种基于文本与草图的交互式场景级图像检索框架IScene。该框架设计了对话重写、相似度优化选择与视觉扩展三个核心模块,构建了一个能够逐步提炼语义、保持判别力稳定并增强视觉表达的检索流程。同时,为支撑交互式研究,本文构建了首个面向该任务的多轮对话数据集。实验结果表明,IScene在多个数据集上的检索精度与稳定性显著优于现有基线方法,为实现更自然、鲁棒的交互式场景检索提供了有效途径。

Abstract: Interactive image retrieval breaks the traditional single-query-return-results paradigm by reshaping the retrieval process into a multi-turn iterative dialogue, allowing users to dynamically guide and refine their intentions based on preliminary results. Text and sketch, as two intuitive and complementary query modalities, offer significant advantages in scene-level image retrieval by effectively expressing complex visual requirements. However, existing methods often rely on the latest-is-best interaction assumption, and their evaluation metrics typically focus only on whether the target is retrieved in any round, ignoring real-world challenges such as noisy feedback, evolving user intent, and insufficient ranking stability. Moreover, sketches are highly abstract and user-drawn with uncertainty, and existing static retrieval models lack the ability to effectively refine ambiguous or incomplete initial inputs through interaction, limiting their practicality and robustness. To address these issues, this paper proposes an interactive text-and-sketch-based scene-level image retrieval framework named IScene. The framework designs three core modules: dialogue rewriting, similarity optimization selection, and visual extension, constructing a retrieval pipeline that progressively refines semantics, maintains discriminative stability, and enhances visual representation. Additionally, to support interactive research, the first multi-turn dialogue dataset for this task is constructed. Experimental results demonstrate that IScene significantly outperforms existing baseline methods in retrieval accuracy and stability across multiple datasets, providing an effective solution for more natural and robust interactive scene retrieval.