作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

用于低光场景分割的RGB-T融合网络

  • 发布日期:2025-09-29

RGB-T Fusion Network for Semantic Segmentation in Low-Light Scenarios

  • Published:2025-09-29

摘要: RGB-T(RGB-Thermal)语义分割是在照明不良或者完全黑暗的情况下实现可靠的语义场景理解的一种解决方案。热成像通过捕捉物体红外辐射特征,能在低光条件下保持稳定的边缘检测能力,可以有效弥补RGB图像在低光下导致的纹理细节丢失问题。然而,对于现有的RGB-T语义分割方法,在多个层次的信息交互中,未能对不同模态之间的有效信息进行充分利用,导致产生不准确的预测结果。为了解决这一问题,构建了CMFANet(Cross-Modal Fusion Attention Network)跨模态融合的注意力网络。首先设计了一个跨模态融合模块,旨在建立RGB图像和热图像特征之间的互补关系;其次考虑到多维度和多尺度信息的重要性,在编码端引入多维度注意力模块用来强化深层特征提取,在解码端引入多尺度特征聚合模块来帮助模型捕捉纹理细节和轮廓信息;最后在解码端引入小波变换与卷积优势互补,提高分割精确性。在MFNet数据集上,平均准确率(mAcc)和平均交并比(mIoU)指标分别达到73.8%和59.0%;在PST900数据集上,mAcc和mIoU指标分别达到90.71%和85.15%。与现有前沿方法相比,模型在关键目标(如MFNet的汽车、行人、自行车和PST900的幸存者、背包)上表现尤为突出,可视化结果验证了其能有效融合RGB与热成像模态信息,在低光场景下恢复纹理细节与目标轮廓,展现出了更好的分割效果和良好的泛化能力。

Abstract: RGB-T (RGB-Thermal) semantic segmentation is a solution that enables reliable semantic scene understanding under poor lighting conditions or in complete darkness. Thermal imaging captures object infrared radiation features, providing stable edge detection under low-light conditions. This effectively compensates for the loss of texture details in RGB images under such environments. However, existing RGB-T semantic segmentation methods fail to fully utilize effective cross-modal information during multi-level interactions, leading to inaccurate predictions. To address this issue, this work constructs CMFANet (Cross-Modal Fusion Attention Network). First, it designs a cross-modal fusion module to establish complementary relationships between RGB and thermal features. Second, considering the importance of multi-dimensional and multi-scale information, a multi-dimensional attention module is introduced at the encoder to enhance deep feature extraction, while a multi-scale feature aggregation module is added at the decoder to capture texture details and contour information. Finally, the decoder integrates wavelet transforms with convolutional operations to improve segmentation accuracy. On the MFNet dataset, CMFANet achieves 73.8% in mean accuracy (mAcc) and 59.0% in mean intersection-over-union (mIoU). On the PST900 dataset, it attains 90.71% mAcc and 85.15% mIoU. Compared with existing cutting-edge methods, the model performs particularly well on key targets (such as cars, persons and bikes in MFNet, and survivors and backpacks in PST900). Visualization results verify its ability to effectively fuse RGB and thermal imaging modality information, restore texture details and target contours in low-light scenarios, and demonstrate better segmentation performance and strong generalization capabilities.