基于特征融合与语义增强的小样本目标检测

doi:10.19678/j.issn.1000-3428.0260302

摘要/Abstract

摘要： 小样本目标检测(Few-Shot Object Detection, FSOD)旨在利用少量标注样本检测新类目标。现有基于元学习的FSOD方法虽通过查询与支持分支协同提升了性能，但仍面临三大瓶颈：一是固定的多尺度特征融合策略忽略了不同分辨率特征间的重要性差异，难以应对多尺度目标；二是基于简单平均池化的类级原型生成方式难以捕捉类内复杂结构，且易受噪声干扰；三是支持集语义匮乏导致查询特征与原型交互时易产生语义偏差，进而引发误检或漏检。针对上述挑战，本文提出了一种基于特征融合与语义增强(Feature Fusion and Semantic Enhancement, FFSE)的小样本目标检测模型。FFSE模型以Meta R-CNN为基础架构，通过设计三个协同互补的核心功能组件，从特征融合、原型表征及特征调制三个维度对小样本目标检测性能进行提升。首先，动态权重特征融合(Dynamic Weight based Feature Fusion, DWFF)模块通过自适应地为不同尺度特征分配权重，有效整合了局部纹理细节与全局语义信息，显著增强了模型对多尺度目标的感知能力。其次，原型图神经网络(Prototype Graph Network, PGN)机制为提升类级原型的质量，利用图神经网络的消息传递机制，实现了对原型的高阶语义增强。经PGN机制处理后的精炼原型具有更强的判别力和鲁棒性，能够更准确地代表目标类别的特征分布。最后，支持集驱动的特征调制(Feature Modulation Driven by Support set, FMDS)模块借鉴特征线性调制的思想，首先在内部对融合后的查询特征进行了多感受野分解，随后，利用精炼原型驱动生成动态缩放因子和偏移因子，通过仿射变换对查询特征进行通道级调制。缩放因子负责放大目标相关特征，而偏移因子则引导查询特征分布向支持集语义空间靠拢，从而有效校正了因类别信息不足引起的语义偏差，增强了目标的显著性。首先，所提方法FFSE在FSOD领域的PASCAL VOC和MS COCO基准数据集上进行了定量评估。在PASCAL VOC数据集上，FFSE在新类三种不同划分下的表现均优于基线方法，在新类三种不同划分的5-shot和10-shot设置下，FFSE模型的nAP50较基线方法提升了至少2.2%；在更复杂的MS COCO数据集上，FFSE模型的nAP较基线方法提升了至少5%；在两个数据集上运行多次实验的均值与标准差，与其他方法相比，所提FFSE模型能够在提升精度的同时，保持了较低的性能波动，表现出优异的鲁棒性。另外，对所提方法FFSE在PASCAL VOC数据集上进行了定性分析，并与其他相关方法进行了比较，实验结果进一步表明FFSE模型在面对复杂场景中的严重遮挡、多变微小目标以及高相似度背景干扰时，能够更准确地锁定并识别目标实例，大幅降低了跨类别的误检与漏检。综上，实验结果表明了所提FFSE模型的有效性。未来，研究工作将致力于探索更好的注意力机制，从更细粒度的像素层级有效抑制背景噪声的干扰，进一步提升小样本目标检测性能。

Abstract: ew-Shot Object Detection (FSOD) aims to detect novel objects using only a few annotated samples. Although existing meta-learning-based FSOD methods have achieved performance improvement through the collaboration of query and support branches, they still encounter three primary bottlenecks. First, fixed multi-scale feature fusion strategies overlook the relative importance of features across different resolutions, making it difficult to handle multi-scale objects; second, class-level prototypes generated via simple average pooling fail to capture the complex intra-class structures and are susceptible to noise interference; third, the semantic scarcity of the support set leads to semantic bias during query-prototype interactions, resulting in false positives or missed detection. To address these challenges, this paper proposes a Feature Fusion and Semantic Enhancement (FFSE) model for few-shot object detection. Built upon the Meta R-CNN framework, FFSE enhances detection performance through three synergistic core modules from three dimensions, i.e., feature fusion, prototype representation, and feature modulation. First, the Dynamic Weight-based Feature Fusion (DWFF) module adaptively assigns weights to features of different scales, effectively integrating local textures with global semantics to strengthen the model's perception of multi-scale objects. Second, to improve class-level prototype quality, the Prototype Graph Network (PGN) mechanism is introduced. By leveraging the message-passing mechanism of graph neural networks, PGN achieves higher-order semantic enhancement, producing refined prototypes with stronger discriminative power and robustness. Finally, inspired by feature linear modulation, the Feature Modulation Driven by Support set (FMDS) module decomposes the fused query features across multiple receptive fields. It then utilizes refined prototypes to generate dynamic scaling and shifting factors for channel-wise affine transformations. The scaling factors amplify target-related features, while the shifting factors guide the query feature distribution toward the support semantic space, effectively correcting semantic biases and enhancing object saliency. Quantitative evaluations have been conducted on PASCAL VOC and MS COCO benchmarks. On PASCAL VOC, FFSE outperforms the baseline method across all three novel-class splits, specifically, for 5-shot and 10-shot settings, the nAP50 increases by at least 2.2%. On the challenging MS COCO dataset, FFSE achieves at least a 5% improvement in nAP over the baseline. Results from multiple experimental runs (mean and standard deviation) demonstrate that FFSE maintains low performance fluctuations and superior robustness while improving accuracy compared to some methods. Qualitative analysis compared to some methods on PASCAL VOC dataset further indicates that FFSE can effectively handle heavy occlusion, diverse tiny objects, and high-similarity background interference, significantly reducing cross-category misidentification. In conclusion, the extensive experimental results validate the effectiveness of the proposed FFSE model. In the future, we will explore the advanced attention mechanisms at the pixel level to effectively suppress background noise for improving the performance of FSOD.

王凯源, 史彩娟, 高炜翔, 张艺琼, 张奕楠. 基于特征融合与语义增强的小样本目标检测[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260302.

WANG Kaiyuan, SHI Caijuan , GAO Weixiang , ZHANG Yiqiong , ZHANG Yinan. Few-shot object detection based on feature fusion and semantic enhancement[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260302.

参考文献

[1] WANG Z, YANG B, YUE H, et al. Fine-grained prototypes distillation for few-shot object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 5859-5866.
[2] WANG X, HUANG T E, DARRELL T, et al. Frustratingly simple few-shot object detection[C]//International Conference on Machine Learning. New York: PMLR, 2020: 9919-9928.
[3] SUN B, LI B, CAI S, et al. FSCE: few-shot object detection via contrastive proposal encoding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 7352-7362.
[4] YAN X, CHEN Z, XU A, et al. Meta R-CNN: towards general solver for instance-level low-shot learning[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9576-9585.
[5] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2125.
[6] WU J, LIU S, HUANG D, et al. Multi-scale positive sample refinement for few-shot object detection[C]//European Conference on Computer Vision. Cham: Springer, 2020: 456-472.
[7] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.
[8] LIU T, ZHOU S, LI W, et al. Semantic prototyping with clip for few-shot object detection in remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63(1): 1-14.
[9] KANG B, LIU Z, WANG X, et al. Few-shot object detection via feature reweighting[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 8420-8429.
[10] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (voc) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-338.
[11] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[12] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International conference on machine learning. PMLR, 2017: 1126-1135.
[13] GAO B, WANG X, YANG Y, et al. Optimization-Inspired Few-Shot Adaptation for Large Language Models[J]. arXiv preprint arXiv:2505.19107, 2025.
[14] ZHANG Y, GONG M, LI J, et al. Few-shot learning with enhancements to data augmentation and feature extraction[J]. IEEE transactions on neural networks and learning systems, 2024, 36(4): 6655-6668.
[15] FENG Y, SHEN A, HU J, et al. Enhancing few-shot learning with integrated data and GAN model approaches[C]//2024 4th International Conference on Digital Society and Intelligent Systems (DSInS). IEEE, 2024: 443-448.
[16] 李广生, 李烨. 基于Swin Transformer和双聚焦相似度的图卷积标签传播网络小样本分类[J]. 建模与仿真, 2025, 14(5): 488-502. LI G S, LI Y. Few-Shot Classification Based on Swin Transformer and Dual-Focus Similarity Graph Convolutional Label Propagation Network [J]. Modeling and Simulation, 2025, 14(5): 488-502 (in Chinese).
[17] KIM J, KIM T, KIM S, et al. Edge-labeling graph neural network for few-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 11-20.
[18] GIDARIS S, KOMODAKIS N. Generating classification weights with gnn denoising autoencoders for few-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 21-30.
[19] HAN G, HE Y, HUANG S, et al. Query adaptive few-shot object detection with heterogeneous graph convolutional networks[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 3263-3272.
[20] 刘春磊, 陈天恩, 王聪, 等. 小样本目标检测研究综述[J]. 计算机科学与探索, 2023, 17(1): 53-73. LIU C L, CHEN T E, WANG C, et al. A Survey on Few-Shot Object Detection [J]. Journal of Computer Science and Exploration, 2023, 17(1): 53-73 (in Chinese).
[21] 章东平, 张煜, 刘志勇, 等. 域适应增强和多尺度特征融合的跨域小样本目标检测方法 [J/OL]. 北京航空航天大学学报, (2025):1-13[2025-09-23]. https://doi.org/10.13700/j.bh.1001-5965.2024.0751. ZHANG D P, ZHANG Y, LIU Z, et al. Cross-Domain Few-Shot Object Detection Method with Domain Adaptation Enhancement and Multi-Scale Feature Fusion [J/OL]. Journal of Beijing University of Aeronautics and Astronautics, (2025):1-13 [2025-09-23]. https://doi.org/10.13700/j. bh.1001-5965.2024.0751(in Chinese).
[22] 刘珂, 林珊玲, 师欣雨, 等. 基于多尺度上下文提取的小样本野生动物检测[J]. 液晶与显示, 2025, 40(03): 516-526. LIU K, LIN S L, SHI X Y, et al. Few-Shot Wildlife Detection Based on Multi-Scale Context Extraction [J]. Liquid Crystals and Displays, 2025, 40(03): 516-526(in Chinese).
[23] FAN Q, ZHUO W, TANG C K, et al. Few-shot object detection with attention-RPN and multi-relation detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4013-4022.
[24] ZHANG L, ZHOU S, GUAN J, et al. Accurate few-shot object detection with support-query mutual guidance and hybrid loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 14424-14432.
[25] 刘洲峰, 邵昕楠, 吴文涛, 等. 基于二阶元学习策略的小样本目标检测算法[J]. 计算机应用, 2025, 45(S2): 88-95. LIU Z F, SHAO X N, WU W T, et al. Few-Shot Object Detection Algorithm Based on Second-Order Meta-Learning Strategy [J]. Journal of Computer Applications, 2025, 45(S2): 88-95 (in Chinese).
[26] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[27] PEREZ E, STRUB F, DE VRIES H, et al. Film: Visual reasoning with a general conditioning layer[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 3942-3949.
[28] 谢斌红, 石宇飞, 张睿, 等. 基于查询引导和语义增强的小样本目标检测方法[J]. 计算机工程, 2026, 52(03): 141-151. XIE B H, SHI Y F, ZHANG R, et al. Few-shot object detection based on query guidance and semantic enhancement [J]. Computer Engineering, 2026, 52(03): 141-151. (in Chinese).
[29] HAN J, REN Y, DING J, et al. Few-shot object detection via variational feature aggregation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2023: 755-763.
[30] HE Z W, LIAO X Y, FENG X. Conditional prototype learning for few-shot object detection[C]// Proceedings of the 36th British Machine Vision Conference. Sheffield, UK: BMVA Press, 2025: 1-15.
[31] WANG Y X, RAMANAN D, HEBERT M. Meta-learning to detect rare objects[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9925-9934.
[32] HAN G, HUANG S, MA J, et al. Meta faster R-CNN: towards accurate few-shot object detection with attentive feature alignment[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 780-789.
[33] ZHANG G J, LUO Z P, CUI K W, et al. Meta-DETR: image-level few-shot detection with inter-class correlation exploitation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(11): 12832-12843.
[34] WANG Y, ZOU X, YAN L, et al. SNIDA: unlocking few-shot object detection with non-linear semantic decoupling augmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 12544-12553.
[35] ZHANG X, CHEN Z, ZHANG J, et al. Learning general and specific embedding with transformer for few-shot object detection[J]. International Journal of Computer Vision, 2025, 133(2): 968-984.
[36] GARCIA-FERNANDEZ P, CORES D, MUCIENTES M. Enhancing few-shot object detection through pseudo-label mining[J]. Image and Vision Computing, 2025, 154: 105379.
[37] ZHAO T, QIU H, WANG L, et al. Class extension logits distillation for few-shot object detection[J]. Displays, 2026, 92: 103338.
[38] CAO Y, WANG J, JIN Y, et al. Few-shot object detection via association and discrimination[C]. Advances in neural information processing systems, 2021, 34: 16570-16581.
[39] QIAO L, ZHAO Y, LI Z, et al. DeFRCN: decoupled faster r-cnn for few-shot object detection[C]// Proceedings of the IEEE/CVF International conference on computer vision. New York: IEEE Press, 2021: 8681-8690.
[40] XIAO Y, LEPETIT V, MARLET R. Few-shot object detection and viewpoint estimation for objects in the wild[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(3): 3090-3106.
[41] SU R, ZHANG K, ZHU S. Few-shot object detection via dynamic feature enhancement and attention template matching[J]. Applied Intelligence, 2026, 56(1): 29.

选择文件类型/文献管理软件名称

选择包含的内容