作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于特征融合与语义增强的小样本目标检测

  • 发布日期:2026-06-17

Few-shot object detection based on feature fusion and semantic enhancement

  • Published:2026-06-17

摘要: 小样本目标检测(Few-Shot Object Detection, FSOD)旨在利用少量标注样本检测新类目标。现有基于元学习的FSOD方法虽通过查询与支持分支协同提升了性能,但仍面临三大瓶颈:一是固定的多尺度特征融合策略忽略了不同分辨率特征间的重要性差异,难以应对多尺度目标;二是基于简单平均池化的类级原型生成方式难以捕捉类内复杂结构,且易受噪声干扰;三是支持集语义匮乏导致查询特征与原型交互时易产生语义偏差,进而引发误检或漏检。针对上述挑战,本文提出了一种基于特征融合与语义增强(Feature Fusion and Semantic Enhancement, FFSE)的小样本目标检测模型。FFSE模型以Meta R-CNN为基础架构,通过设计三个协同互补的核心功能组件,从特征融合、原型表征及特征调制三个维度对小样本目标检测性能进行提升。首先,动态权重特征融合(Dynamic Weight based Feature Fusion, DWFF)模块通过自适应地为不同尺度特征分配权重,有效整合了局部纹理细节与全局语义信息,显著增强了模型对多尺度目标的感知能力。其次,原型图神经网络(Prototype Graph Network, PGN)机制为提升类级原型的质量,利用图神经网络的消息传递机制,实现了对原型的高阶语义增强。经PGN机制处理后的精炼原型具有更强的判别力和鲁棒性,能够更准确地代表目标类别的特征分布。最后,支持集驱动的特征调制(Feature Modulation Driven by Support set, FMDS)模块借鉴特征线性调制的思想,首先在内部对融合后的查询特征进行了多感受野分解,随后,利用精炼原型驱动生成动态缩放因子和偏移因子,通过仿射变换对查询特征进行通道级调制。缩放因子负责放大目标相关特征,而偏移因子则引导查询特征分布向支持集语义空间靠拢,从而有效校正了因类别信息不足引起的语义偏差,增强了目标的显著性。首先,所提方法FFSE在FSOD领域的PASCAL VOC和MS COCO基准数据集上进行了定量评估。在PASCAL VOC数据集上,FFSE在新类三种不同划分下的表现均优于基线方法,在新类三种不同划分的5-shot和10-shot设置下,FFSE模型的nAP50较基线方法提升了至少2.2%;在更复杂的MS COCO数据集上,FFSE模型的nAP较基线方法提升了至少5%;在两个数据集上运行多次实验的均值与标准差,与其他方法相比,所提FFSE模型能够在提升精度的同时,保持了较低的性能波动,表现出优异的鲁棒性。另外,对所提方法FFSE在PASCAL VOC数据集上进行了定性分析,并与其他相关方法进行了比较,实验结果进一步表明FFSE模型在面对复杂场景中的严重遮挡、多变微小目标以及高相似度背景干扰时,能够更准确地锁定并识别目标实例,大幅降低了跨类别的误检与漏检。综上,实验结果表明了所提FFSE模型的有效性。未来,研究工作将致力于探索更好的注意力机制,从更细粒度的像素层级有效抑制背景噪声的干扰,进一步提升小样本目标检测性能。

Abstract: ew-Shot Object Detection (FSOD) aims to detect novel objects using only a few annotated samples. Although existing meta-learning-based FSOD methods have achieved performance improvement through the collaboration of query and support branches, they still encounter three primary bottlenecks. First, fixed multi-scale feature fusion strategies overlook the relative importance of features across different resolutions, making it difficult to handle multi-scale objects; second, class-level prototypes generated via simple average pooling fail to capture the complex intra-class structures and are susceptible to noise interference; third, the semantic scarcity of the support set leads to semantic bias during query-prototype interactions, resulting in false positives or missed detection. To address these challenges, this paper proposes a Feature Fusion and Semantic Enhancement (FFSE) model for few-shot object detection. Built upon the Meta R-CNN framework, FFSE enhances detection performance through three synergistic core modules from three dimensions, i.e., feature fusion, prototype representation, and feature modulation. First, the Dynamic Weight-based Feature Fusion (DWFF) module adaptively assigns weights to features of different scales, effectively integrating local textures with global semantics to strengthen the model's perception of multi-scale objects. Second, to improve class-level prototype quality, the Prototype Graph Network (PGN) mechanism is introduced. By leveraging the message-passing mechanism of graph neural networks, PGN achieves higher-order semantic enhancement, producing refined prototypes with stronger discriminative power and robustness. Finally, inspired by feature linear modulation, the Feature Modulation Driven by Support set (FMDS) module decomposes the fused query features across multiple receptive fields. It then utilizes refined prototypes to generate dynamic scaling and shifting factors for channel-wise affine transformations. The scaling factors amplify target-related features, while the shifting factors guide the query feature distribution toward the support semantic space, effectively correcting semantic biases and enhancing object saliency. Quantitative evaluations have been conducted on PASCAL VOC and MS COCO benchmarks. On PASCAL VOC, FFSE outperforms the baseline method across all three novel-class splits, specifically, for 5-shot and 10-shot settings, the nAP50 increases by at least 2.2%. On the challenging MS COCO dataset, FFSE achieves at least a 5% improvement in nAP over the baseline. Results from multiple experimental runs (mean and standard deviation) demonstrate that FFSE maintains low performance fluctuations and superior robustness while improving accuracy compared to some methods. Qualitative analysis compared to some methods on PASCAL VOC dataset further indicates that FFSE can effectively handle heavy occlusion, diverse tiny objects, and high-similarity background interference, significantly reducing cross-category misidentification. In conclusion, the extensive experimental results validate the effectiveness of the proposed FFSE model. In the future, we will explore the advanced attention mechanisms at the pixel level to effectively suppress background noise for improving the performance of FSOD.