基于模型的强化学习在无人机路径规划中的应用

doi:10.19678/j.issn.1000-3428.0063156

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 255-260,269. doi: 10.19678/j.issn.1000-3428.0063156

基于模型的强化学习在无人机路径规划中的应用

杨思明^1,2, 单征¹, 曹江², 郭佳郁¹, 高原², 郭洋², 王平², 王景², 王晓楠²

1. 数学工程与先进计算国家重点实验室, 郑州 450001;
2. 军事科学院, 北京 100091

收稿日期:2021-11-08 修回日期:2021-12-23 发布日期:2022-12-07
作者简介:杨思明（1994—），男，硕士研究生，主研方向为深度学习、强化学习；单征，教授；曹江，研究员；郭佳郁，硕士研究生；高原，副教授；郭洋、王平，助理研究员；王景，研究员；王晓楠，助理研究员。
基金资助:
国家自然科学基金（61971092，61701503）。

Application of Model-Based Reinforcement Learning in Path Planning of Unmanned Aerial Vehicle

YANG Siming^1,2, SHAN Zheng¹, CAO Jiang², GUO Jiayu¹, GAO Yuan², GUO Yang², WANG Ping², WANG Jing², WANG Xiaonan²

1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China;
2. Academy of Military Sciences, Beijing 100091, China

Received:2021-11-08 Revised:2021-12-23 Published:2022-12-07

摘要/Abstract

摘要： 针对当前强化学习算法在无人机升空平台路径规划任务中样本效率低、算法鲁棒性较差的问题，提出一种基于模型的内在奖励强化学习算法。采用并行架构将数据收集操作和策略更新操作完全解耦，提升算法学习效率，并运用内在奖励的方法提高智能体对环境的探索效率，避免收敛到次优策略。在策略学习过程中，智能体针对模拟环境的动态模型进行学习，从而在有限步内更好地预测状态、奖励等信息。在此基础上，通过结合有限步的规划计算以及神经网络的预测，提升价值函数的预测精准度，以利用较少的经验数据完成智能体的训练。实验结果表明，相比同样架构的无模型强化学习算法，该算法达到相同训练水平所需的经验数据量减少近600幕数据，样本效率和算法鲁棒性都有大幅提升，相比传统的非强化学习启发类算法，分数提升接近8 000分，与MVE等主流的基于模型的强化学习算法相比，平均分数可以提升接近2 000分，且在样本效率和稳定性上都有明显提高。

关键词: 无人机, 升空平台, 路径规划, 强化学习, 深度学习

Abstract: This paper focuses on the problem of low sample efficiency and poor algorithm robustness of the current reinforcement learning algorithms used for path planning of Unmanned Aerial Vehicle(UAV) platforms.Furthermore, this paper proposes a model-based reinforcement learning algorithm with intrinsic rewards.The algorithm adopts a parallel architecture, completely decouples data collection operations and policy update operations, and improves the learning efficiency of the algorithm.Moreover, intrinsic reward improves the agent's exploration efficiency and prevents convergence to sub-optimal strategies.In the strategy learning process, the agent learns based on the dynamic model of the simulated environment, so that information such as the state and reward can be better predicted within a limited step.Finally, by combining finite planning calculation steps and neural network prediction, the prediction accuracy of the value function is improved.This reduces the amount of empirical data required to complete the training of the agent.The experiment results show that our algorithm, compared with the model-free reinforcement learning algorithm of the same architecture, requires approximately 600 fewer empirical data to achieve the same training level.The sample efficiency and algorithm robustness are also greatly improved.Compared with traditional heuristic algorithms, the score improves by nearly 8 000 points.Compared with mainstream model-based reinforcement learning algorithms such as MVE, the average score of the algorithm can improve by approximately 2 000 points and the agent had obvious advantages in sample efficiency and stability.

Key words: Unmanned Aerial Vehicle(UAV), aerial platform, path planning, reinforcement learning, deep learning

中图分类号:

TP391

杨思明, 单征, 曹江, 郭佳郁, 高原, 郭洋, 王平, 王景, 王晓楠. 基于模型的强化学习在无人机路径规划中的应用[J]. 计算机工程, 2022, 48(12): 255-260,269.

YANG Siming, SHAN Zheng, CAO Jiang, GUO Jiayu, GAO Yuan, GUO Yang, WANG Ping, WANG Jing, WANG Xiaonan. Application of Model-Based Reinforcement Learning in Path Planning of Unmanned Aerial Vehicle[J]. Computer Engineering, 2022, 48(12): 255-260,269.

https://www.ecice06.com/CN/Y2022/V48/I12/255

图/表 6

20230112184655

20230112184700

20230112184703

20230112184707

20230112184711

20230112184715

参考文献

[1] PEPPER R.Cisco visual networking index global mobile data traffic forecast update[EB/OL].[2021-09-30].https://www.gsma.com/spectrum/wpcontent/uploads/2013/03/Cisco_VNI-global-mobile-data-traffic-forecastupdate.pdf.
[2] ROMERO D, LEUS G.Non-cooperative aerial base station placement via stochastic optimization[C]//Proceedings of the 15th International Conference on Mobile Ad-Hoc and Sensor Networks.Washington D.C., USA:IEEE Press, 2019:131-136.
[3] ZENG Y, ZHANG R.Energy-efficient UAV communication with trajectory optimization[J].IEEE Transactions on Wireless Communications, 2017, 16(6):3747-3760.
[4] LYU J B, ZENG Y, ZHANG R, et al.Placement optimization of UAV-mounted mobile base stations[J].IEEE Communications Letters, 2017, 21(3):604-607.
[5] ALZENAD M, EL-KEYI A, LAGUM F, et al.3-D placement of an unmanned aerial vehicle base station for energy-efficient maximal coverage[J].IEEE Wireless Communications Letters, 2017, 6(4):434-437.
[6] KALANTARI E, YANIKOMEROGLU H, YONGACOGLU A.On the number and 3D placement of drone base stations in wireless cellular networks[C]//Proceedings of the 84th IEEE Vehicular Technology Conference.Washington D.C., USA:IEEE Press, 2016:1-6.
[7] AL-HOURANI A, KANDEEPAN S, LARDNER S.Optimal LAP altitude for maximum coverage[J].IEEE Wireless Communications Letters, 2014, 3(6):569-572.
[8] GUO J L, HUO Y H, SHI X J, et al.3D aerial vehicle base station (UAV-BS) position planning based on deep Q-learning for capacity enhancement of users with different QoS requirements[C]//Proceedings of the 15th International Wireless Communications & Mobile Computing Conference.Washington D.C., USA:IEEE Press, 2019:1508-1512.
[9] BAYERLEIN H, DE KERRET P, GESBERT D.Trajectory optimization for autonomous flying base station via reinforcement learning[C]//Proceedings of the 19th IEEE International Workshop on Signal Processing Advances in Wireless Communications.Washington D.C., USA:IEEE Press, 2018:1-5.
[10] MNIH V, KAVUKCUOGLU K, SILVER D, et al.Playing atari with deep reinforcement learning[J].Computer Science, 2013, 25:253-262.
[11] WANG Q, ZHANG W Q, LIU Y W, et al.Multi-UAV dynamic wireless networking with deep reinforcement learning[J].IEEE Communications Letters, 2019, 23(12):2243-2246.
[12] VAN HASSELT H, GUEZ A, SILVER D.Deep reinforcement learning with double Q-learning[J].Artificial Intelligence, 2016, 30(1):14-20.
[13] LIU C H, MA X X, GAO X D, et al.Distributed energy-efficient multi-UAV navigation for long-term communication coverage by deep reinforcement learning[J].IEEE Transactions on Mobile Computing, 2020, 19(6):1274-1285.
[14] QI H, HU Z Q, HUANG H, et al.Energy efficient 3-D UAV control for persistent communication service and fairness:a deep reinforcement learning approach[J].IEEE Access, 2020, 8:53172-53184.
[15] LILLICRAP T P, HUNT J J, PRITZEL A, et al.Continuous control with deep reinforcement learning[EB/OL].[2021-09-30].https://arxiv.org/abs/1509.02971.
[16] YANG S M, SHAN Z, CAO J, et al.Path planning of UAV base station based on deep reinforcement learning[J].Procedia Computer Science, 2022, 202:89-104.
[17] FUJIMOTO S, VAN HOOF H, MEGER D.Addressing function approximation error in actor-critic methods[EB/OL].[2021-09-30].https://arxiv.org/abs/1802.09477.
[18] ANTHONY T, TIAN Z, BARBER D.Imagination-augmented agents for deep reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:5360-5370.
[19] NAGABANDI A, KAHN G, FEARING R S, et al.Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//Proceedings of IEEE International Conference on Robotics and Automation.Washington D.C., USA:IEEE Press, 2018:7559-7566.
[20] BUCKMAN J, HAFNER D, TUCKER G, et al.Sample-efficient reinforcement learning with stochastic ensemble value expansion[EB/OL].[2021-09-30].https://arxiv.org/abs/1807.01675.
[21] KURUTACH T, CLAVERA I, DUAN Y, et al.Model-ensemble trust-region policy optimization[EB/OL].[2021-09-30].https://arxiv.org/abs/1802.10592.
[22] FEINBERG V, WAN A, STOICA I, et al.Model-based value estimation for efficient model-free reinforcement learning[EB/OL].[2021-09-30].https://arxiv.org/abs/1803.00101.
[23] CLAVERA I, ROTHFUSS J, SCHULMAN J, et al.Model-based reinforcement learning via meta-policy optimization[EB/OL].[2021-09-30].https://arxiv.org/abs/1809.05214.
[24] Recommendation ITU-R.Propagation data and prediction methods required for the design of terrestrial broadband millimetric radio access systems operating in a frequency range of about 20~50 GHz[R].Geneva, Switzerland, 2001.
[25] BROCKMAN G, CHEUNG V, PETTERSSON L, et al.OpenAI Gym[EB/OL].[2021-09-30].https://arxiv.org/abs/1606.01540.

选择文件类型/文献管理软件名称

选择包含的内容

基于模型的强化学习在无人机路径规划中的应用

Application of Model-Based Reinforcement Learning in Path Planning of Unmanned Aerial Vehicle

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[3]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[4]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[5]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[6]	张亚洲, 和玉, 戎璐, 王祥凯. 基于上下文知识增强型Transformer网络的抑郁检测[J]. 计算机工程, 2024, 50(8): 75-85.
[7]	王志特, 罗丽平, 廖义奎. 改进A^*算法融合改进动态窗口法的移动机器人路径规划[J]. 计算机工程, 2024, 50(8): 86-101.
[8]	石琼, 段辉, 师智斌. 基于深度强化学习的可信任务卸载方案[J]. 计算机工程, 2024, 50(8): 142-152.
[9]	钱清, 龙永, 蒋忠远, 段春红, 王宏. 基于深度强化学习的自适应图像隐写算法[J]. 计算机工程, 2024, 50(8): 319-327.
[10]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[11]	王宇, 祁琦, 王纯, 许才. 储能变流器信号高精度故障诊断方法[J]. 计算机工程, 2024, 50(8): 389-396.
[12]	牛瑞婷, 严天峰, 高锐, 王映植. 低信噪比下基于深度学习TCNN-MobileNet的调制识别[J]. 计算机工程, 2024, 50(7): 204-215.
[13]	肖慈, 徐杨, 张永丹, 冯明文, 黄易仟. 结合注意力和低光增强的夜间语义分割[J]. 计算机工程, 2024, 50(7): 271-281.
[14]	张诗婧, 莫绪涛, 赵行, 董杨林. 基于球面折反射成像和YOLOv7的内螺纹缺陷检测[J]. 计算机工程, 2024, 50(7): 282-292.
[15]	徐明亮, 李芳媛, 马浩然, 何飞. 大规模神经记录的峰电位聚类算法(特邀)[J]. 计算机工程, 2024, 50(6): 1-34.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于模型的强化学习在无人机路径规划中的应用

Application of Model-Based Reinforcement Learning in Path Planning of Unmanned Aerial Vehicle

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献

相关文章 15

编辑推荐

Metrics

本文评价