Outdoor Vision-and-Language Navigation Based on LLM and Exploration Module

doi:10.19678/j.issn.1000-3428.0252069

Abstract

Abstract: The Vision and Language Navigation (VLN) task aims to guide an agent to move to a target location in a 3D or real-world environment based on language instructions. However, traditional end-to-end VLN algorithms have limitations. When an erroneous action occurs in navigation planning, the agent tends to go into incorrect paths, resulting in an inability to continue following instructions or exploring unnecessary areas. To address this issue, an agent named Nav-Explore is proposed, which is based on a large language model and an exploration module. The agent leverages the reasoning capabilities of the large language model to predict the next action according to the language instructions and current visual information, and uses an exploration module to balance exploration and exploitation. The exploration module employs an epsilon-greedy strategy to toggle between normal navigation and exploration modes. When the random probability is below epsilon, the agent explores possible future paths to assess the feasibility of next actions, thus avoiding wrong decisions. If the probability exceeds epsilon, it directly uses the large language model's output for navigation. This modular design enables the Nav-Explore to effectively enhance navigation success rates and improve the agent’s generalization ability in unseen environments. Experimental results demonstrate that the Nav-Explore achieves superior performance on two outdoor VLN benchmark datasets, Touchdown and Map2seq, significantly increasing navigation success rates. Furthermore, the Nav-Explore also exhibits strong generalization capabilities, effectively completing navigation tasks in different environments.

摘要： 视觉语言导航 (VLN) 任务旨在引导智能体根据语言指令在3D或真实环境中移动到目标位置。然而，传统端到端深度学习VLN算法存在不足，智能体在导航规划中一旦出现错误动作，就容易进入错误路径，导致无法继续遵循指令或探索不必要的区域。为了解决这一问题，本文提出一种基于大模型和探索模块的智能体Nav-Explore。该智能体利用大模型强大的推理能力，结合语言指令和当前视觉信息预测下一步动作，并引入探索模块以平衡探索与利用。探索模块通过 ε - 贪婪策略决定智能体在正常导航和探索模式间切换，当随机概率小于 ε 时进入探索模式，智能体通过探索可能的未来路径，提前评估下一步行动的可行性，从而有效避免错误决策；而当随机概率大于 ε 时，智能体直接采用大模型输出的动作进行导航。这种模块化设计使得 Nav-Explore 方法能够有效地提升导航成功率，并增强智能体在未见环境中的泛化能力。实验结果表明，Nav-Explore在Touchdown和Map2seq两个户外环境VLN 基准数据集上取得了优异的性能，显著提升了导航成功率。此外，Nav-Explore也展现出较强的泛化能力，能够在不同的环境中有效地完成导航任务。

LIU Ziyi, SHA Ying. Outdoor Vision-and-Language Navigation Based on LLM and Exploration Module[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252069.

刘梓熠, 沙灜. 基于大模型和探索模块的户外场景视觉语言导航[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252069.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252069

References

[1] JING G, ELIANA S, QI W, JESSE T, XIN E W, et al. Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions[J], Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers), 2022
[2] ANDERSON P, WU Q, TENEY D, BRUCE J, JOHNSON M, SÜNDERHAUF N, et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition. Salt Lake City, USA: IEEE, 2018: 3674−3683.
[3] KEJI H, YAN H, QI W, JIANHUA Y, DONG A, SHUANGLIN S, LIANG W, et al. Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision[C], Conference on Neural Information Processing Systems, 2021, 34.
[4] WANRONG Z, XIN E W, TSU-JUI F, AN Y, PRADYUMNA N, KAZOO S, SUGATO B, WILLIAM Y W, et al. Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation[J], Computing Research Repository, 2021.
[5] HUI WEI, LUPING WANG. Visual Navigation Using Projection of Spatial Right-Angle in Indoor Environment.[J], IEEE Transactions on Image Processing, 2018, 27(7): 3164-3177.
[6] GENGZE ZHOU, YICONG HONG, QI WU. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models[J], Computing Research Repository, 2024: 7641-7649.
[7] BINGQIAN L, YUNSHUANG N, ZIMING W, JIAQI C, SHIKUI M, JIANHUA H, HANG X, XIAOJUN C,XIAODAN L, et al. NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning[J], CoRR, 2024, abs/2403.07376.
[8] CHIH-YAO M, JIASEN L, ZUXUAN W, GHASSAN A, ZSOLT K, RICHARD S, CAIMING X, et al. Self-Monitoring Navigation Agent Via Auxiliary Progress Estimation.[J], arXiv: Artificial Intelligence, 2019, abs/1901.03035
[9] NIYATI R, ROBERTO B, LORENZO B, RITA C, et al. AIGeN: an Adversarial Approach for Instruction Generation in VLN[J], Computer Vision and Pattern Recognition, 2024: 2070-2080.
[10] FENGDA Z, XIWEN L, YI Z, QIZHI Y, XIAOJUN C, XIAODAN L, et al. SOON: Scenario Oriented Object Navigation with Graph-based Exploration[J], Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, abs/2103.17138.
[11] JESSE T, MICHAEL M, MAYA C, LUKE Z, et al. Vision-and-Dialog Navigation.[J], CoRL, 2019: 394-406.
[12] DANIEL F, RONGHANG H, VOLKAN C, ANNA R, JACOB A, LOUIS-PHILIPPE M, TAYLOR B, KATE S, DAN K, TREVOR D, et al. Speaker-Follower Models for Vision-and-Language Navigation.[J], Neural Information Processing Systems, 2018, 31: 3314-3325.
[13] YICONG H, CRISTIAN R, YUANKAI Q, QI W, STEPHEN G, et al. Language and Visual Entity Relationship Graph for Agent Navigation[J], Neural Information Processing Systems, 2020, abs/2010.09304: 7685-7696.
[14] 王少桐, 况立群, 韩慧妍, 熊风光, 薛红新. 基于优势后见经验回放的强化学习导航方法[J]. 计算机工程, 2024, 50(1): 313-319. WANG, S. T., KUANG, L. Q., HAN, H. Y., XIONG, F. G., & XUE, H. X. Reinforcement Learning Navigation Method Based on Advantage Experience Replay. Computer Engineering, 2024, 50(1), 313-319.
[15] 闫皎洁, 张锲石, 胡希平. 基于强化学习的路径规划技术综述[J]. 计算机工程, 2021, 47(10): 16-25. YAN, J. J., ZHANG, K. S., & HU, X. P. Review of Path Planning Technology Based on Reinforcement Learning. Computer Engineering, 2021, 47(10), 16-25.
[16] HOWARD C, ALANE S, DIPENDRA M, NOAH S, YOAV A, et al. Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments[J], Computing Research Repository, 2019: 12538-12547.
[17] DEVENDRA S C, KANTHASHREE M S, RAMA K P, DHEERAJ R, RUSLAN S, et al. Gated-Attention Architectures for Task-Oriented Language Grounding.[J], Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1).
[18] DANNY D, FEI X, MEHDI S M S, COREY L, AAKANKSHA C, BRIAN I, AYZAAN W, JONATHAN T, QUAN V, TIANHE ( Y, WENLONG H, YEVGEN C, PIERRE S, DANIEL D, SERGEY L, VINCENT V, KAROL H, MARC T, KLAUS G, ANDY Z, IGOR M, PETE F, et al. PaLM-E: an Embodied Multimodal Language Model[C], International Conference on Machine Learning, 2023.
[19] LONG Y, LI X, CAI W, DONG H, et al. Discuss Before Moving: Visual Language Navigation Via Multi-expert Discussions[C], IEEE International Conference on Robotics and Automation, 2024.
[20] XINGHANG L, MINGHUAN L, HANBO Z, CUNJUN Y, JIE X, HONGTAO W, CHILAM C, YA J, WEINAN Z, HUAPING L, HANG L, TAO K, et al. Vision-Language Foundation Models As Effective Robot Imitators[C], International Conference on Learning Representations, 2024.
[21] MICHAEL A, ANTHONY B, NOAH B, YEVGEN C, OMAR C, BYRON D, CHELSEA F, CHUYUAN F,KEERTHANA G, KAROL H, ALEX H, DANIEL H, JASMINE H, JULIAN I, BRIAN I, ALEX I, ERIC J, ROSARIO J R, KYLE J, SALLY J, NIKHIL J J, RYAN J, DMITRY K, YUHENG K, KUANG-HUEI L, SERGEY L, YAO L, LINDA L, CAROLINA P, PETER P, JORNELL Q, KANISHKA R, JAREK R, DIEGO R, PIERRE S, NICOLAS S, CLAYTON T, ALEXANDER T, VINCENT V, FEI X, TED X, PENG X, SICHUN X, MENGYUAN Y, ANDY Z, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances[J], arXiv preprint arXiv, 2022, 2204(01691): 287-318.
[22] ABHINAV R, KARAN S, XIAO L, BHORAM L, HAN-PANG C, ALVARO V, et al. SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments[J], Proceedings of the International Conference on Automated Planning and Scheduling, 2024, 34: 464-474.
[23] YIJIE G, JUNHYUK O, SATINDER S, HONGLAK L, et al. Generative Adversarial Self-Imitation Learning[J], Computing Research Repository, 2018, abs/1812.00950.
[24] LIYIMING K, XIUJUN L, YONATAN B, ARI H, ZHE G, JINGJING L, JIANFENG G, YEJIN C, SIDDHARTHA S, et al. Tactical Rewind: Self-Correction Via Backtracking in Vision-And-Language Navigation[J], Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019: 6741-6749.
[25] GEORG O, MARC G B, AARON V D O, REMI M, et al. Count-Based Exploration with Neural Density Models[C], International Conference on Machine Learning, 2017: 2721-2730.
[26] HANQING W, WENGUAN W, WEI L, CAIMING X, JIANBING S, et al. Structured Scene Memory for Vision-Language Navigation[J], 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[27] JUNNAN L, DONGXU L, CAIMING X, STEVEN H, et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.[J], International Conference on Machine Learning, 2022, abs/2201.12086: 12888-12900.
[28] WILL DABNEY, GEORG OSTROVSKI, ANDRÉ BARRETO. Temporally-Extended Ε-Greedy Exploration.[C], International Conference on Learning Representations, 2021.
[29] EDWARD J H, YELONG S, PHILLIP W, ZEYUAN A, YUANZHI L, SHEAN W, WEIZHU C, et al. LoRA: Low-Rank Adaptation of Large Language Models[J], Computing Research Repository, 2022, abs/2106.09685.
[30] HUILIN T, JINGKE M, WEI-SHI Z, YUAN-MING L, JUNKAI Y, YUNONG Z, et al. Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation[J], ACM International Conference on Multimedia, 2024: 4073-4081.
[31] XIAOHUA Z, BASIL M, ALEXANDER K, LUCAS B, et al. Sigmoid Loss for Language Image Pre-Training[J], Computing Research Repository, 2023: 11941-11952.

Please choose a citation manager

Content to export