基于文本与草图的择优增强交互式场景级图像检索系统

doi:10.19678/j.issn.1000-3428.0260066

摘要/Abstract

摘要： 交互式图像检索突破了传统单次查询-返回结果的静态范式，将检索过程重塑为多轮迭代的人机对话，允许用户依据初步结果动态引导与细化检索意图。文本与草图作为两种直观互补的查询模态，在场景级图像检索中具有显著优势，能够有效表达用户的复杂视觉需求。然而，现有方法的交互机制多基于最新即最佳的简单假设，缺乏对历史状态的择优与维持能力，导致检索过程易受噪声干扰且稳定性不足，此外，其评估指标往往仅关注是否在某一轮次检索到目标，忽视了真实交互中用户反馈含噪、意图持续演化以及检索结果稳定性不足等问题；此外，草图本身具有高度抽象性与用户绘制不确定性，现有静态检索模型难以在初始输入不完整或存在歧义时通过交互进行有效细化，导致其实用性与鲁棒性受限。为此，本文提出一种基于文本与草图的交互式场景级图像检索框架IScene。该框架设计了对话重写、相似度优化选择与视觉扩展三个核心模块，构建了一个能够逐步提炼语义、保持判别力稳定并增强视觉表达的检索流程。同时，为支撑交互式研究，本文构建了首个面向该任务的多轮对话数据集。实验结果表明，IScene在多个数据集上的检索精度与稳定性显著优于现有基线方法，为实现更自然、鲁棒的交互式场景检索提供了有效途径。

Abstract: Interactive image retrieval breaks the traditional single-query-return-results paradigm by reshaping the retrieval process into a multi-turn iterative dialogue, allowing users to dynamically guide and refine their intentions based on preliminary results. Text and sketch, as two intuitive and complementary query modalities, offer significant advantages in scene-level image retrieval by effectively expressing complex visual requirements. However, existing methods often rely on the latest-is-best interaction assumption, and their evaluation metrics typically focus only on whether the target is retrieved in any round, ignoring real-world challenges such as noisy feedback, evolving user intent, and insufficient ranking stability. Moreover, sketches are highly abstract and user-drawn with uncertainty, and existing static retrieval models lack the ability to effectively refine ambiguous or incomplete initial inputs through interaction, limiting their practicality and robustness. To address these issues, this paper proposes an interactive text-and-sketch-based scene-level image retrieval framework named IScene. The framework designs three core modules: dialogue rewriting, similarity optimization selection, and visual extension, constructing a retrieval pipeline that progressively refines semantics, maintains discriminative stability, and enhances visual representation. Additionally, to support interactive research, the first multi-turn dialogue dataset for this task is constructed. Experimental results demonstrate that IScene significantly outperforms existing baseline methods in retrieval accuracy and stability across multiple datasets, providing an effective solution for more natural and robust interactive scene retrieval.

龙海清, 李茂. 基于文本与草图的择优增强交互式场景级图像检索系统[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260066.

Long Haiqing, Li Mao. Optimization-Enhanced Interactive Scene-Level Image Retrieval System with Text and Sketch[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260066.

参考文献

[1] 胡静赵新瑜, HU JING Z. 基于跨域特征解耦与语义原型引导的图文检索方法[J/OL]. 计算机工程, 2026: 0. DOI:10.19678/j.issn.1000-3428.0252767. Hu jing, Zhao xinyu, Peng mingchao. Image-Text Retrieval via Cross-Domain Feature Disentanglement and Semantic Prototype Guidance[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252767.
[2] 杨钰雪何甜, YANG YUXUE H T. 基于交叉注意力与特征聚合的跨模态图文检索研究[J/OL]. 计算机工程, 2025: 0. DOI:10.19678/j.issn.1000-3428.0070119. YANG Yuxue, HE Tian, FAN Jinghang, LIU Ruiying, LI Teng. Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation[J]. Computer Engineering, 2026, 52(2): 311-321.
[3] SANGKLOY P, JITKRITTUM W, YANG D, 等. A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch[A/OL]. arXiv, 2022[2025-11-28]. http://arxiv.org/abs/2208.03354. DOI:10.48550/arXiv.2208.03354.
[4] CHOWDHURY P N, BHUNIA A K, SAIN A, 等. SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text[A/OL]. arXiv, 2023[2025-11-28]. http://arxiv.org/abs/2204.11964. DOI:10.48550/arXiv.2204.11964.
[5] LEVY M, BEN-ARI R, DARSHAN N, 等. Chatting Makes Perfect: Chat-based Image Retrieval[J]. Advances in Neural Information Processing Systems, 2023, 36: 61437-61449.
[6] 基于属性解纠缠表示的交互式服装图像检索 - 中国知网[EB/OL]. [2026-01-11]. https://kns.cnki.net/kcms2/article/abstract?v=SQKXI91EiTp0CGtl5Rf8eW087z7OMVV71F131ywdaIrBv5GE6bu5LLCV6r03kJ5u8a2BGj257RIeZg43H8X9YocoSe0LIfp689s3zQinFlirGM_LXQXdMBc-bmZn5ISf1SbLLhRniAv1STjWnIxSxrCGuK5F8g2bfKacxOPyHOTgS6Hlp0SldA==&uniplatform=NZKPT⟨uage=CHS. HUANG Xiaoju, HUANG Xiaoju. Interactive Clothing Retrieval Based on Attribute Disentangled Representations [J]. Computer & Digital Engineering,2025,53(03):829-834.
[7] ZHU H, HUANG J H, RUDINAC S, 等. Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models[C/OL]//Proceedings of the 2024 International Conference on Multimedia Retrieval. 2024: 978-987[2025-11-24]. http://arxiv.org/abs/2404.18746. DOI:10.1145/3652583.3658032.
[8] LIU F, ZOU C, DENG X, 等. SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches[C/OL]//VEDALDI A, BISCHOF H, BROX T, 等. Computer Vision – ECCV 2020. Cham: Springer International Publishing, 2020: 718-734. DOI:10.1007/978-3-030-58529-7_42.
[9] WU Z, WANG Q, YANG J. SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation[A/OL]. arXiv, 2024[2025-11-28]. http://arxiv.org/abs/2405.18801. DOI:10.48550/arXiv.2405.18801.
[10] KARTHIK S, ROTH K, MANCINI M, 等. Vision-by-Language for Training-Free Compositional Image Retrieval[A/OL]. arXiv, 2023[2024-11-26]. https://arxiv.org/abs/2310.09291. DOI:10.48550/ARXIV.2310.09291.
[11] LÜLF C, LIMA MARTINS D M, VAZ SALLES M A, 等. CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval[C/OL]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: Association for Computing Machinery, 2024: 2719-2723[2026-01-10]. https://dl.acm.org/doi/10.1145/3626772.3657678. DOI:10.1145/3626772.3657678.
[12] LEE S, YU S, PARK J, 等. Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach[C/OL]//KU L W, MARTINS A, SRIKUMAR V. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024: 791-809[2025-12-26]. https://aclanthology.org/2024.acl-long.46/. DOI:10.18653/v1/2024.acl-long.46.
[13] LONG Z, LIANG K, ARAGON CAMARASA G, 等. Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval[C/OL]//Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. Padua Italy: ACM, 2025: 823-832[2025-11-21]. https://dl.acm.org/doi/10.1145/3726302.3729950. DOI:10.1145/3726302.3729950.
[14] HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models[C/OL]//Advances in Neural Information Processing Systems: 卷 33. Curran Associates, Inc., 2020: 6840-6851[2025-04-02]. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
[15] ROMBACH R, BLATTMANN A, LORENZ D, 等. High-Resolution Image Synthesis with Latent Diffusion Models[C/OL]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022: 10674-10685[2025-04-09]. https://ieeexplore.ieee.org/document/9878449. DOI:10.1109/CVPR52688.2022.01042.
[16] GAO C, LIU Q, XU Q, 等. SketchyCOCO: Image Generation from Freehand Scene Sketches[C/OL]. [2026][2026-01-08]. https://openaccess.thecvf.com/content_CVPR_2020/html/Gao_SketchyCOCO_Image_Generation_From_Freehand_Scene_Sketches_CVPR_2020_paper.html.
[17] HAN T, SCHLANGEN D. Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task[C/OL]//KONDRAK G, WATANABE T. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Taipei, Taiwan: Asian Federation of Natural Language Processing, 2017: 361-365[2026-01-11]. https://aclanthology.org/I17-2061/.
[18] SONG J, SONG Y zhe, XIANG T, 等. Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma[C/OL]//Procedings of the British Machine Vision Conference 2017. London, UK: British Machine Vision Association, 2017: 45[2026-01-11]. http://www.bmva.org/bmvc/2017/papers/paper045/index.html. DOI:10.5244/C.31.45.
[19] DEY S, DUTTA A, GHOSH S K, 等. Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch[C/OL]//2018 24th International Conference on Pattern Recognition (ICPR). 2018: 916-921[2026-01-11]. https://ieeexplore.ieee.org/document/8545452. DOI:10.1109/ICPR.2018.8545452.
[20] KOLEY S, BHUNIA A K, SAIN A, 等. You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval[A/OL]. arXiv, 2024[2025-11-04]. http://arxiv.org/abs/2403.07222. DOI:10.48550/arXiv.2403.07222.
[21] LIU F, DENG X, ZOU C, 等. SceneSketcher-v2: Fine-Grained Scene-Level Sketch-Based Image Retrieval Using Adaptive GCNs[J/OL]. IEEE Transactions on Image Processing, 2022, 31: 3737-3751. DOI:10.1109/TIP.2022.3175403.
[22] GATTI P, PARIKH K, PAUL D P, 等. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions[A/OL]. arXiv, 2025[2025-11-28]. http://arxiv.org/abs/2502.08438. DOI:10.48550/arXiv.2502.08438.
[23] ZUO R, HU H, DENG X, 等. SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models[C/OL]//Thirty-Third International Joint Conference on Artificial Intelligence. 2024: 1825-1833[2025-03-28]. https://www.ijcai.org/proceedings/2024/202. DOI:10.24963/ijcai.2024/202.
[24] CHOWDHURY P N, SAIN A, BHUNIA A K, 等. FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context[C/OL]//AVIDAN S, BROSTOW G, CISSÉ M, 等. Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, 2022: 253-270. DOI:10.1007/978-3-031-20074-8_15.
[25] BAI S, CAI Y, CHEN R, 等. Qwen3-VL Technical Report[A/OL]. arXiv, 2025[2026-01-12]. http://arxiv.org/abs/2511.21631. DOI:10.48550/arXiv.2511.21631.
[26] LI J, LI D, XIONG C, 等. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation[C/OL]//Proceedings of the 39th International Conference on Machine Learning. 2022[2025-04-10]. https://proceedings.mlr.press/v162/li22n.html.
[27] MOU C, WANG X, XIE L, 等. T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models[J/OL]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4296-4304. DOI:10.1609/aaai.v38i5.28226.
[28] XUE L, SHU M, AWADALLA A, 等. xGen-MM (BLIP-3): A Family of Open Large Multimodal Models[A/OL]. arXiv, 2025[2026-01-11]. http://arxiv.org/abs/2408.08872. DOI:10.48550/arXiv.2408.08872.

选择文件类型/文献管理软件名称

选择包含的内容