Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Extraction of Typical Scene Elements from Street View Images Based on Large Multimodal Models

  

  • Published:2025-04-09

基于大型多模态模型的街景图像典型场景要素提取

Abstract: Scene elements are fundamental for understanding urban geographic information, and their accurate extraction is essential for smart city development and geographic information systems. To address the complexity of street view images, limitations of existing deep learning models in interpreting complex scenes, and challenges in associating visual data with context, a method based on large multimodal models for extracting typical scene elements from street view images is proposed. Firstly, the approach extends the LLaVA by integrating a multilayer perceptron and a high-resolution visual encoder to create GeoLLaVA. Secondly, a Street View Visual-Instruction Following Dataset is constructed for scene element extraction tasks, providing multidimensional instructions. The model was fine-tuned using visual instructions to enhance its contextual understanding. Low-Rank Adaptation (LoRA) is used to optimize computational efficiency. Finally, GeoLLaVA generates multidimensional scene descriptions from street view images and extracts key element keywords for effective scene element extraction. In comparative experiments with semantic segmentation, object detection, and other multimodal models, GeoLLaVA demonstrates significant advantages, achieving F1 scores of 0.938, 0.842, and 0.829 for the extraction of traffic signals, intersections, and parking lots, respectively. The comparison between the model before and after fine-tuning clearly demonstrates the effectiveness of the fine-tuning process. Ablation studies further validate the performance improvements achieved by the modified GeoLLaVA architecture, and LoRA effectively reduces computational resources consumption. Regional application experiments, using batch inference on street view images with geographic coordinates, a comparison with OpenStreetMap (OSM) data not only confirms the model’s accuracy but also highlights the limitations of OSM data in providing comprehensive element information.

摘要: 场景要素是理解城市地理信息的核心,准确提取场景要素对于智慧城市建设和地理信息系统发展至关重要。为应对街景图像场景的复杂性、现有视觉深度学习模型在理解复杂场景和要素方面的局限,以及视觉信息与上下文关联的挑战,提出了一种基于大型多模态模型的典型街景场景要素提取方法。首先,基于LLaVA模型引入多层感知机和高分辨率视觉编码器,构建GeoLLaVA模型;其次,针对街景场景要素提取任务构建街景视觉-指令跟随数据集,提供多维度指令,通过视觉指令微调模型,增强其对复杂街景场景的上下文理解,同时,引入低秩自适应技术(LoRA)降低计算资源需求;最后,通过GeoLLaVA模型生成街景图像的多维度场景描述,并提取关键词以获得典型场景要素。在与语义分割、目标检测及其他多模态模型的对比实验中,GeoLLaVA表现出了显著优势,在交通信号灯、交叉路口和停车场要素提取任务中分别取得了0.938、0.842和0.829的F1分数。模型微调前后的对比展现了微调的有效性。消融实验进一步验证GeoLLaVA改进结构对性能提升的帮助以及LoRA在降低计算资源方面的有效性。区域应用实验通过批量推理特定区域的街景图像,提取要素并结合地理位置进行可视化展示,与开放街景地图(OSM)数据对比,验证了模型的准确性并揭示了OSM在提供要素信息方面的不足。