Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Outdoor Vision-and-Language Navigation Based on LLM and Exploration Module

  

  • Published:2025-05-15

基于大模型和探索模块的户外场景视觉语言导航

Abstract: The Vision and Language Navigation (VLN) task aims to guide an agent to move to a target location in a 3D or real-world environment based on language instructions. However, traditional end-to-end VLN algorithms have limitations. When an erroneous action occurs in navigation planning, the agent tends to go into incorrect paths, resulting in an inability to continue following instructions or exploring unnecessary areas. To address this issue, an agent named Nav-Explore is proposed, which is based on a large language model and an exploration module. The agent leverages the reasoning capabilities of the large language model to predict the next action according to the language instructions and current visual information, and uses an exploration module to balance exploration and exploitation. The exploration module employs an epsilon-greedy strategy to toggle between normal navigation and exploration modes. When the random probability is below epsilon, the agent explores possible future paths to assess the feasibility of next actions, thus avoiding wrong decisions. If the probability exceeds epsilon, it directly uses the large language model's output for navigation. This modular design enables the Nav-Explore to effectively enhance navigation success rates and improve the agent’s generalization ability in unseen environments. Experimental results demonstrate that the Nav-Explore achieves superior performance on two outdoor VLN benchmark datasets, Touchdown and Map2seq, significantly increasing navigation success rates. Furthermore, the Nav-Explore also exhibits strong generalization capabilities, effectively completing navigation tasks in different environments.

摘要: 视觉语言导航 (VLN) 任务旨在引导智能体根据语言指令在3D或真实环境中移动到目标位置。然而,传统端到端深度学习VLN算法存在不足,智能体在导航规划中一旦出现错误动作,就容易进入错误路径,导致无法继续遵循指令或探索不必要的区域。为了解决这一问题,本文提出一种基于大模型和探索模块的智能体Nav-Explore。该智能体利用大模型强大的推理能力,结合语言指令和当前视觉信息预测下一步动作,并引入探索模块以平衡探索与利用。探索模块通过 ε - 贪婪策略决定智能体在正常导航和探索模式间切换,当随机概率小于 ε 时进入探索模式,智能体通过探索可能的未来路径,提前评估下一步行动的可行性,从而有效避免错误决策;而当随机概率大于 ε 时,智能体直接采用大模型输出的动作进行导航。这种模块化设计使得 Nav-Explore 方法能够有效地提升导航成功率,并增强智能体在未见环境中的泛化能力。实验结果表明,Nav-Explore在Touchdown和Map2seq两个户外环境VLN 基准数据集上取得了优异的性能,显著提升了导航成功率。此外,Nav-Explore也展现出较强的泛化能力,能够在不同的环境中有效地完成导航任务。