Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (10): 18-26. doi: 10.19678/j.issn.1000-3428.0070569

• Research Hotspots and Reviews • Previous Articles     Next Articles

HDMapFusion: High-Definition Map Generation with Multi-Modality Fusion for Autonomous Driving(Invited)

LIU Yanghong1,2,3,4, FU Yangyouran1,2,3,4, DONG Xingping1,2,3,4,*()   

  1. 1. School of Computer Science, Wuhan University, Wuhan 430072, Hubei, China
    2. National Engineering Research Center for Multimedia Software, Wuhan 430072, Hubei, China
    3. Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan 430072, Hubei, China
    4. School of Artificial Intelligence, Wuhan University, Wuhan 430072, Hubei, China
  • Received:2024-11-01 Revised:2025-02-18 Online:2025-10-15 Published:2025-04-24
  • Contact: DONG Xingping

HDMapFusion: 用于自动驾驶的多模态融合高清地图生成(特邀)

刘洋宏1,2,3,4, 付杨悠然1,2,3,4, 董性平1,2,3,4,*()   

  1. 1. 武汉大学计算机学院, 湖北 武汉 430072
    2. 国家多媒体软件工程技术研究中心, 湖北 武汉 430072
    3. 多媒体网络通信工程湖北省重点实验室, 湖北 武汉 430072
    4. 武汉大学人工智能学院, 湖北 武汉 430072
  • 通讯作者: 董性平
  • 基金资助:
    国家自然科学基金(62471342); 中央高校基本科研业务费专项资金(2042024kf0036); 澳门科学技术发展基金(001/2024/SKL); 智慧城市物联网国家重点实验室(澳门大学)开放课题(SKL-IoTSC(UM)-2024-2026/ORP/GA04/2023)

Abstract:

The generation of High-Definition (HD) environmental semantic maps is indispensable for environmental perception and decision making in autonomous driving systems. To address the modality discrepancy between cameras and LiDARs in perception tasks, this paper proposes an innovative multimodal fusion framework, HDMapFusion, which significantly improves semantic map generation accuracy via feature-level fusion. Unlike traditional methods that directly fuse raw sensor data, our approach innovatively transforms both camera images and LiDAR point cloud features into a unified Bird's-Eye-View (BEV) representation, enabling physically interpretable fusion of multimodal information within a consistent geometric coordinate system. Specifically, this method first extracts visual features from camera images and 3D structural features from LiDAR point clouds using deep learning networks. Subsequently, a differentiable perspective transformation module converts the front-view image features into a BEV space and the LiDAR point clouds are projected into the same BEV space through voxelization. Building on this, an attention-based feature fusion module is designed to adaptively integrate the two modalities using weighted aggregation. Finally, a semantic decoder generates high-precision semantic maps containing lane lines, pedestrian crossings, road boundary lines, and other key elements. Systematic experiments conducted on the nuScenes benchmark dataset demonstrate that HDMapFusion significantly outperforms existing baseline methods in terms of HD map generation accuracy. These results validate the effectiveness and superiority of the proposed method, offering a novel solution to multimodal fusion in autonomous driving perception.

Key words: high-definition map generation, multi-modality fusion, Bird's-Eye-View (BEV) representation, autonomous driving, depth estimation

摘要:

高清环境语义地图的生成是自动驾驶系统实现环境感知与决策规划不可或缺的关键技术。针对当前自动驾驶领域相机与激光雷达在感知任务中存在的模态差异问题, 提出一种创新的多模态融合范式HDMapFusion, 通过特征级融合策略显著提升了语义地图的生成精度。与传统直接融合原始传感器数据的方法不同, HDMapFusion创新性地将相机图像特征和激光雷达点云特征统一转换为鸟瞰视图(BEV)空间表示, 在统一的几何坐标系下实现了多模态信息的物理可解释性融合。具体而言: HDMapFusion首先通过深度学习网络分别提取相机图像的视觉特征和激光雷达的三维(3D)结构特征; 然后利用可微分的视角变换模块将前视图像特征转换为BEV空间表示, 同时将激光雷达点云特征通过体素化处理投影到相同的BEV空间, 在此基础上设计一个基于注意力机制的特征融合模块, 自适应地加权整合两种模态; 最后通过语义解码器生成包含车道线、人行横道、道路边界线等要素的高精度语义地图。在nuScenes自动驾驶数据集上的实验结果表明, HDMapFusion在高清地图生成精度方面显著优于现有基准方法。这些实验结果验证了HDMapFusion的有效性和优越性, 为自动驾驶环境感知中的多模态融合问题提供了新的解决思路。

关键词: 高清地图生成, 多模态融合, 鸟瞰视图表示, 自动驾驶, 深度估计