作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 255-260,269. doi: 10.19678/j.issn.1000-3428.0063156

• 开发研究与工程应用 • 上一篇    下一篇

基于模型的强化学习在无人机路径规划中的应用

杨思明1,2, 单征1, 曹江2, 郭佳郁1, 高原2, 郭洋2, 王平2, 王景2, 王晓楠2   

  1. 1. 数学工程与先进计算国家重点实验室, 郑州 450001;
    2. 军事科学院, 北京 100091
  • 收稿日期:2021-11-08 修回日期:2021-12-23 发布日期:2022-12-07
  • 作者简介:杨思明(1994—),男,硕士研究生,主研方向为深度学习、强化学习;单征,教授;曹江,研究员;郭佳郁,硕士研究生;高原,副教授;郭洋、王平,助理研究员;王景,研究员;王晓楠,助理研究员。
  • 基金资助:
    国家自然科学基金(61971092,61701503)。

Application of Model-Based Reinforcement Learning in Path Planning of Unmanned Aerial Vehicle

YANG Siming1,2, SHAN Zheng1, CAO Jiang2, GUO Jiayu1, GAO Yuan2, GUO Yang2, WANG Ping2, WANG Jing2, WANG Xiaonan2   

  1. 1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China;
    2. Academy of Military Sciences, Beijing 100091, China
  • Received:2021-11-08 Revised:2021-12-23 Published:2022-12-07

摘要: 针对当前强化学习算法在无人机升空平台路径规划任务中样本效率低、算法鲁棒性较差的问题,提出一种基于模型的内在奖励强化学习算法。采用并行架构将数据收集操作和策略更新操作完全解耦,提升算法学习效率,并运用内在奖励的方法提高智能体对环境的探索效率,避免收敛到次优策略。在策略学习过程中,智能体针对模拟环境的动态模型进行学习,从而在有限步内更好地预测状态、奖励等信息。在此基础上,通过结合有限步的规划计算以及神经网络的预测,提升价值函数的预测精准度,以利用较少的经验数据完成智能体的训练。实验结果表明,相比同样架构的无模型强化学习算法,该算法达到相同训练水平所需的经验数据量减少近600幕数据,样本效率和算法鲁棒性都有大幅提升,相比传统的非强化学习启发类算法,分数提升接近8 000分,与MVE等主流的基于模型的强化学习算法相比,平均分数可以提升接近2 000分,且在样本效率和稳定性上都有明显提高。

关键词: 无人机, 升空平台, 路径规划, 强化学习, 深度学习

Abstract: This paper focuses on the problem of low sample efficiency and poor algorithm robustness of the current reinforcement learning algorithms used for path planning of Unmanned Aerial Vehicle(UAV) platforms.Furthermore, this paper proposes a model-based reinforcement learning algorithm with intrinsic rewards.The algorithm adopts a parallel architecture, completely decouples data collection operations and policy update operations, and improves the learning efficiency of the algorithm.Moreover, intrinsic reward improves the agent's exploration efficiency and prevents convergence to sub-optimal strategies.In the strategy learning process, the agent learns based on the dynamic model of the simulated environment, so that information such as the state and reward can be better predicted within a limited step.Finally, by combining finite planning calculation steps and neural network prediction, the prediction accuracy of the value function is improved.This reduces the amount of empirical data required to complete the training of the agent.The experiment results show that our algorithm, compared with the model-free reinforcement learning algorithm of the same architecture, requires approximately 600 fewer empirical data to achieve the same training level.The sample efficiency and algorithm robustness are also greatly improved.Compared with traditional heuristic algorithms, the score improves by nearly 8 000 points.Compared with mainstream model-based reinforcement learning algorithms such as MVE, the average score of the algorithm can improve by approximately 2 000 points and the agent had obvious advantages in sample efficiency and stability.

Key words: Unmanned Aerial Vehicle(UAV), aerial platform, path planning, reinforcement learning, deep learning

中图分类号: