HDMapFusion: High-Definition Map Generation with Multi-Modality Fusion for Autonomous Driving(Invited)

doi:10.19678/j.issn.1000-3428.0070569

Abstract

Abstract:

The generation of High-Definition (HD) environmental semantic maps is indispensable for environmental perception and decision making in autonomous driving systems. To address the modality discrepancy between cameras and LiDARs in perception tasks, this paper proposes an innovative multimodal fusion framework, HDMapFusion, which significantly improves semantic map generation accuracy via feature-level fusion. Unlike traditional methods that directly fuse raw sensor data, our approach innovatively transforms both camera images and LiDAR point cloud features into a unified Bird's-Eye-View (BEV) representation, enabling physically interpretable fusion of multimodal information within a consistent geometric coordinate system. Specifically, this method first extracts visual features from camera images and 3D structural features from LiDAR point clouds using deep learning networks. Subsequently, a differentiable perspective transformation module converts the front-view image features into a BEV space and the LiDAR point clouds are projected into the same BEV space through voxelization. Building on this, an attention-based feature fusion module is designed to adaptively integrate the two modalities using weighted aggregation. Finally, a semantic decoder generates high-precision semantic maps containing lane lines, pedestrian crossings, road boundary lines, and other key elements. Systematic experiments conducted on the nuScenes benchmark dataset demonstrate that HDMapFusion significantly outperforms existing baseline methods in terms of HD map generation accuracy. These results validate the effectiveness and superiority of the proposed method, offering a novel solution to multimodal fusion in autonomous driving perception.

Key words: high-definition map generation, multi-modality fusion, Bird's-Eye-View (BEV) representation, autonomous driving, depth estimation

摘要：

高清环境语义地图的生成是自动驾驶系统实现环境感知与决策规划不可或缺的关键技术。针对当前自动驾驶领域相机与激光雷达在感知任务中存在的模态差异问题, 提出一种创新的多模态融合范式HDMapFusion, 通过特征级融合策略显著提升了语义地图的生成精度。与传统直接融合原始传感器数据的方法不同, HDMapFusion创新性地将相机图像特征和激光雷达点云特征统一转换为鸟瞰视图(BEV)空间表示, 在统一的几何坐标系下实现了多模态信息的物理可解释性融合。具体而言: HDMapFusion首先通过深度学习网络分别提取相机图像的视觉特征和激光雷达的三维(3D)结构特征; 然后利用可微分的视角变换模块将前视图像特征转换为BEV空间表示, 同时将激光雷达点云特征通过体素化处理投影到相同的BEV空间, 在此基础上设计一个基于注意力机制的特征融合模块, 自适应地加权整合两种模态; 最后通过语义解码器生成包含车道线、人行横道、道路边界线等要素的高精度语义地图。在nuScenes自动驾驶数据集上的实验结果表明, HDMapFusion在高清地图生成精度方面显著优于现有基准方法。这些实验结果验证了HDMapFusion的有效性和优越性, 为自动驾驶环境感知中的多模态融合问题提供了新的解决思路。

关键词: 高清地图生成, 多模态融合, 鸟瞰视图表示, 自动驾驶, 深度估计

LIU Yanghong, FU Yangyouran, DONG Xingping. HDMapFusion: High-Definition Map Generation with Multi-Modality Fusion for Autonomous Driving(Invited)[J]. Computer Engineering, 2025, 51(10): 18-26.

刘洋宏, 付杨悠然, 董性平. HDMapFusion: 用于自动驾驶的多模态融合高清地图生成(特邀)[J]. 计算机工程, 2025, 51(10): 18-26.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0070569

https://www.ecice06.com/EN/Y2025/V51/I10/18

Figures/Tables 8

References 27

1	LIAO B, CHEN S, WANG X, et al. MapTR: structured modeling and learning for online vectorized HD map construction[EB/OL]. [2024-10-08]. https://arxiv.org/abs/2208.14437.
2	赵南南, 高翡晨. 基于改进YOLOv8的交通场景实例分割算法. 计算机工程, 2025, 51 (1): 198- 207. doi: 10.19678/j.issn.1000-3428.0068677
	ZHAO N N , GAO F C . Improved YOLOv8-based algorithm for instance segmentation in traffic scenes. Computer Engineering, 2025, 51 (1): 198- 207. doi: 10.19678/j.issn.1000-3428.0068677
3	秦严严. 交通流分析理论. 北京: 人民交通出版社, 2023.
	QIN Y Y . Theory of traffic flow analysis. Beijing: China Communications Press, 2023.
4	LI Q, WANG Y, WANG Y L, et al. HDMapNet: an online HD map construction and evaluation framework[C]//Proceedings of the International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2022: 4628-4634.
5	周海赟, 项学智, 王馨遥, 等. 多特征融合的端到端链式行人多目标跟踪网络. 计算机工程, 2022, 48 (9): 305- 313. doi: 10.19678/j.issn.1000-3428.0062296
	ZHOU H Y , XIANG X Z , WANG X Y , et al. Chained end-to-end pedestrian multi-object tracking network with multi-feature fusion. Computer Engineering, 2022, 48 (9): 305- 313. doi: 10.19678/j.issn.1000-3428.0062296
6	刘宏纬, 邵东恒, 杨剑, 等. 基于鸟瞰图融合的多级旋转等变目标检测网络. 计算机工程, 2024, 50 (11): 246- 257. doi: 10.19678/j.issn.1000-3428.0068696
	LIU H W , SHAO D H , YANG J , et al. Multi-level rotational equivariant object detection network based on BEV fusion. Computer Engineering, 2024, 50 (11): 246- 257. doi: 10.19678/j.issn.1000-3428.0068696
7	LIU Z J, TANG H T, AMINI A, et al. BEVFusion: multi-task multi-sensor fusion with unified bird's-eye view representation[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2023: 2774-2781.
8	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2024-10-08]. https://arxiv.org/abs/1706.03762.
9	CAESAR H, BANKITI V, LANG A H, et al. nuScenes: a multimodal dataset for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 11618-11628.
10	DONG H, GU W H, ZHANG X J, et al. SuperFusion: multilevel LiDAR-camera fusion for long-range HD map generation[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2024: 9056-9062.
11	SUN L , YANG K L , HU X X , et al. Real-time fusion network for RGB-D semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robotics and Automation Letters, 2020, 5 (4): 5558- 5565. doi: 10.1109/LRA.2020.3007457
12	HU S C, CHEN L, WU P H, et al. ST-P3: end-to-end vision-based autonomous driving via spatial-temporal feature learning[EB/OL]. [2024-10-08]. https://arxiv.org/abs/2207.07601.
13	RHINEHART N, MCALLISTER R, KITANI K, et al. PRECOG: prediction conditioned on goals in visual multi-agent settings[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 2821-2830.
14	XU H, YANG C G, LI Z J. OD-SLAM: real-time localization and mapping in dynamic environment through multi-sensor fusion[C]//Proceedings of the 5th International Conference on Advanced Robotics and Mechatronics (ICARM). Washington D.C., USA: IEEE Press, 2020: 172-177.
15	CHEN J X , LI X , XIE J , et al. CBI-GNN: cross-scale bilateral graph neural network for 3D object detection. IEEE Transactions on Intelligent Transportation Systems, 2022, 23 (12): 23124- 23135. doi: 10.1109/TITS.2022.3202943
16	PHILION J, FIDLER S. Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer International Publishing, 2020: 194-210.
17	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.
18	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 770-778.
19	LANG A H, VORA S, CAESAR H, et al. PointPillars: fast encoders for object detection from point clouds[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2019: 12689-12697.
20	CHARLES R Q, HAO S, MO K C, et al. PointNet: deep learning on point sets for 3D classification and segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2017: 77-85.
21	MAN Y Z, GUI L Y, WANG Y X. DualCross: cross-modality cross-domain adaptation for monocular BEV perception[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D.C., USA: IEEE Press, 2023: 10910-10917.
22	RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL]. [2023-10-08]. https://www.semanticscholar.org/paper/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
23	DENG L Y , YANG M , LI H , et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras. IEEE Transactions on Intelligent Transportation Systems, 2019, 21 (10): 4350- 4362.
24	PAN B W , SUN J K , LEUNG H Y T , et al. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 2020, 5 (3): 4867- 4873. doi: 10.1109/LRA.2020.3004325
25	吴永庆, 姜正宇. 基于解耦合动态时空卷积循环网络的交通流预测[J/OL]. 计算机工程: 1-12[2024-12-19]. https://doi.org/10.19678/j.issn.1000-3428.0070319.
	WU Y Q, JIANG Z Y. Traffic flow prediction based on decoupled dynamic spatio-temporal convolution cyclic network[J/OL]. Computer Engineering: 1-12[2024-12-19]. https://doi.org/10.19678/j.issn.1000-3428.0070319. (in Chinese)
26	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2024-10-08]. https://arxiv.org/abs/1711.05101.
27	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of ECCV'14. Berlin, Germany: Springer International Publishing, 2014: 740-755.

[1]	ZHANG Zhaoxin, HUANG Shize, ZHANG Bingjie, SHEN Tuo. Camouflaged Adversarial Example Generation Method for the Form of Motion Blur in Traffic Scenes [J]. Computer Engineering, 2025, 51(3): 45-53.
[2]	Ci XIAO, Yang XU, Yongdan ZHANG, Mingwen FENG, Yiqian HUANG. Nighttime Semantic Segmentation with Attention and Low-Light Enhancement [J]. Computer Engineering, 2024, 50(7): 271-281.
[3]	Xiying ZHANG, Shoudong SUN, Haihao YU, Jilong BIAN. Spatial Propagation-based Multi-View 3D Reconstruction [J]. Computer Engineering, 2024, 50(7): 293-302.
[4]	FU Mingjian, GUO Fuqiang. Research on Decision-Making at Intersection Without Traffic Lights Based on Deep Reinforcement Learning [J]. Computer Engineering, 2024, 50(5): 91-99.
[5]	Rong FEI, Mengyang MA, Xiao ZHANG, Xinhong HEI, Qingzheng XU, Yuan QIU. Collision Detection Model for Autonomous Driving Based on Trajectory Prediction and Conflict Detection [J]. Computer Engineering, 2023, 49(7): 10-20.
[6]	Hua HOU, Hongyang GUO, Chaona DAI, Junhui LI. Stereo Matching Algorithm Combining Multiple Attention and Iterative Optimization [J]. Computer Engineering, 2023, 49(7): 161-168.
[7]	LI Xuesong, ZHANG Qieshi, SONG Chengqun, KANG Yuhang, CHENG Jun. Review of Trajectory Prediction Technology in Autonomous Driving Scenes [J]. Computer Engineering, 2023, 49(5): 1-11.
[8]	LUO Shaocong, ZHANG Xudong, WAN Le, XIE Linfang, LI Shuyu. Light Field Depth Estimation Method Combining Image Feature Transfer [J]. Computer Engineering, 2023, 49(4): 206-216.
[9]	WEN Jing, YANG Jie. Depth Estimation Based on Scene Object Attention and Depth Map Fusion [J]. Computer Engineering, 2023, 49(2): 222-230.
[10]	Jiaming LEI, Hui YU, Yu XIA, Jielong GUO, Xian WEI. Outdoor 3D Object Detection Method Based on Multi-Direction Features Fusion [J]. Computer Engineering, 2023, 49(11): 238-246.
[11]	Jianhao ZHAN, Lipeng GAN, Yonghui BI, Peng ZENG, Xiaochao LI. Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation [J]. Computer Engineering, 2023, 49(10): 280-288, 297.
[12]	ZHUANG Zijie, FAN Zhiguo, JIN Haihong, GONG Kaiqiang. Underwater Image Restoration Method Based on Water Attenuation Coefficient Inversion [J]. Computer Engineering, 2023, 49(1): 258-269.
[13]	DU Tian, LI Xin, LAI Chengzhe, ZHENG Dong. Secure Trust Management Scheme for Autonomous Driving Map Updating [J]. Computer Engineering, 2022, 48(6): 154-166.
[14]	ZHANG Xiangfen, LIU Yan, YUAN Feiniu. 3D Medical Image Segmentation Based on Inverted Pyramid Deep Learning Network [J]. Computer Engineering, 2022, 48(12): 304-311.
[15]	SONG Jiayan, SU Shengchao. Multi-Vehicle Collaborative Motion Planning for Autonomous Driving Based on Improved Ant Colony Optimization Algorithm [J]. Computer Engineering, 2022, 48(11): 299-305,313.

Please choose a citation manager

Content to export