基于优势后见经验回放的强化学习导航方法

doi:10.19678/j.issn.1000-3428.0066193

摘要/Abstract

摘要：

目前强化学习在移动机器人领域表现出了强大的潜力，将强化学习算法与机器人导航相结合，不需要依赖先验知识就可以实现移动机器人的自主导航，但是在机器人强化学习过程中存在样本利用率低且泛化能力不强的问题。针对上述问题，在D3QN算法的基础上提出优势后见经验回放算法用于经验样本的回放。首先计算轨迹样本中轨迹点的优势函数值，选择优势函数最大值的点作为目标点，然后对轨迹样本进行重新标记，将新旧轨迹样本一同放入经验池中增加经验样本的多样性，使智能体利用失败的经验样本学习，更高效地实现到目标点的导航。为评估该方法的有效性，基于Gazebo平台搭建不同的实验环境，并采用TurtleBot3机器人在仿真环境下进行导航训练与迁移测试，结果表明，该算法在训练环境下导航成功率高于当前主流算法，在迁移测试环境中导航成功率可达86.33%，能够有效提高导航样本利用率，降低导航策略学习难度，增强移动机器人在不同环境中的自主导航能力和迁移泛化能力。

关键词: 强化学习, 移动机器人, 后见经验回放, 神经网络, 样本利用率

Abstract:

Reinforcement learning demonstrates significant potential in the field of mobile robots. By combining reinforcement learning algorithms with robot navigation, the autonomous piloting of robots can be achieved without prior knowledge. However, robot reinforcement learning is associated with some disadvantages, such as low sample utilization ratios and poor generalization ability. Hence, based on the D3QN algorithm, this paper proposes an advantage hindsight experience replay algorithm for the playback of experience samples. First, the advantage function value of trajectory points in trajectory samples is calculated, and the point with the maximum advantage function is selected as the target point. Subsequently, the trajectory samples are relabeled, and the old and new trajectory samples are placed simultaneously into the experience pool to increase the diversity of experience samples, thus allowing the agent to learn to navigate to the target point more efficiently by learning the failed experience samples. To assess the validity of the proposed approach, different experimental environments are established using the Gazebo platform, and a TurtleBot3 robot is used to conduct navigation training and transfer tests in the simulation environment. The results show that the navigation success rate in the training environment is higher than that yielded by the current mainstream algorithm, and that the maximum navigation success rate achieved in the transfer test environment is 86.33%. Improving the algorithm can enhance the utilization ratio of navigation samples, reduce the difficulty of learning navigation strategies, and enhance the autonomous navigation ability and migration generalization ability of the robot in different environments.

Key words: reinforcement learning, mobile robots, hindsight experience replay, neural network, sample utilization

王少桐, 况立群, 韩慧妍, 熊风光, 薛红新. 基于优势后见经验回放的强化学习导航方法[J]. 计算机工程, 2024, 50(1): 313-319.

Shaotong WANG, Liqun KUANG, Huiyan HAN, Fengguang XIONG, Hongxin XUE. Reinforcement Learning Navigation Method Based on Advantage Hindsight Experience Replay[J]. Computer Engineering, 2024, 50(1): 313-319.

http://www.ecice06.com/CN/Y2024/V50/I1/313

图/表 9

图1 优势后见经验回放算法框架

Fig.1 Framework of advantage hindsight experience replay algorithm

图2 实验环境

Fig.2 Experimental environment

图3 机器人示意图

Fig.3 Schematic drawing of robot

图4 训练学习结果

Fig.4 Training learning results

图5 机器人导航路径图

Fig.5 Robot navigation path diagram

参考文献 28

1	闫皎洁, 张锲石, 胡希平. 基于强化学习的路径规划技术综述. 计算机工程, 2021, 47 (10): 16- 25. URL
	YAN J J, ZHANG Q S, HU X P. Review of path planning techniques based on reinforcement learning. Computer Engineering, 2021, 47 (10): 16- 25. URL
2	孙辉辉, 胡春鹤, 张军国. 移动机器人运动规划中的深度强化学习方法. 控制与决策, 2021, 36 (6): 1281- 1292. doi: 10.13195/j.kzyjc.2020.0470
	SUN H H, HU C H, ZHANG J G. Deep reinforcement learning for motion planning of mobile robots. Control and Decision, 2021, 36 (6): 1281- 1292. doi: 10.13195/j.kzyjc.2020.0470
3	ZHU K, ZHANG T. Deep reinforcement learning based mobile robot navigation: a review. Tsinghua Science and Technology, 2021, 26 (5): 674- 691. doi: 10.26599/TST.2021.9010012
4	黄锐. 基于深度强化学习的移动机器人导航策略研究[D]. 成都: 电子科技大学, 2021.
	HUANG R. Research on navigation strategy of mobile robot based on deep reinforcement learning[D]. Chengdu: University of Electronic Science and Technology of China, 2021. (in Chinese)
5	MIROWSKI P, PASCANU R, VIOLA F, et al. Learning to navigate in complex environments[EB/OL]. [2022-09-30]. https://arxiv.org/abs/1611.03673.pdf.
6	KULHANEK J, DERNER E, BABUSKA R. Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning. IEEE Robotics and Automation Letters, 2021, 6 (3): 4345- 4352. doi: 10.1109/LRA.2021.3068106
7	YOKOYAMA K, MORIOKA K. Autonomous mobile robot with simple navigation system based on deep reinforcement learning and a monocular camera[C]//Proceedings of 2020 IEEE/SICE International Symposium on System Integration. Honolulu, USA: IEEE Press, 2020: 525-530.
8	杨思明, 单征, 丁煜, 等. 深度强化学习研究综述. 计算机工程, 2021, 47 (12): 19- 29. URL
	YANG S M, SHAN Z, DING Y, et al. Survey of research on deep reinforcement learning. Computer Engineering, 2021, 47 (12): 19- 29. URL
9	ZENG L K, YAO W, SHUAI H, et al. Resilience assessment for power systems under sequential attacks using double DQN with improved prioritized experience replay. IEEE Systems Journal, 2023, 17 (2): 1865- 1876. doi: 10.1109/JSYST.2022.3171240
10	刘颖. 深度强化学习中的经验回放研究[D]. 南京: 东南大学, 2021.
	LIU Y. Research on experience replay in deep reinforcement learning[D]. Nanjing: Southeast University, 2021. (in Chinese)
11	SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay[EB/OL]. [2022-09-30]. https://arxiv.org/abs/1511.05952.pdf.
12	孙涵彬. 基于经验池重采样的强化学习算法优化[D]. 成都: 电子科技大学, 2022.
	SUN H B. Optimization of reinforcement learning algorithm based on experience pool resampling[D]. Chengdu: University of Electronic Science and Technology of China, 2022. (in Chinese)
13	陈茜. 基于经验回放机制的深度强化学习算法改进及应用[D]. 南京: 东南大学, 2021.
	CHEN Q. Improvement and application of deep reinforcement learning algorithm based on experience playback mechanism[D]. Nanjing: Southeast University, 2021. (in Chinese)
14	MORO L, LIKMETA A, PRATI E, et al. Goal-directed planning via hindsight experience replay[C]//Proceedings of International Conference on Learning Representations. Washington D. C., USA: IEEE Press, 2022: 1-16.
15	LI K Y, LU Y, MENG M Q H. Human-aware robot navigation via reinforcement learning with hindsight experience replay and curriculum learning[C]//Proceedings of 2021 IEEE International Conference on Robotics and Biomimetics. Washington D. C., USA: IEEE Press, 2021: 346-351.
16	LUU T M, YOO C D. Hindsight goal ranking on replay buffer for sparse reward environment. IEEE Access, 2021, 9, 51996- 52007. doi: 10.1109/ACCESS.2021.3069975
17	FANG M, ZHOU T, DU Y, et al. Curriculum-guided hindsight experience replay[C]//Proceedings of Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 32.
18	VECCHIETTI L F, SEO M, HAR D. Sampling rate decay in hindsight experience replay for robot control. IEEE Transactions on Cybernetics, 2022, 52 (3): 1515- 1526. doi: 10.1109/TCYB.2020.2990722
19	YANG R, FANG M, HAN L, et al. MHER: model-based hindsight experience replay[EB/OL]. [2022-09-30]. https://arxiv.org/abs/2107.00306.pdf.
20	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2022-09-30]. https://arxiv.org/abs/1312.5602.pdf.
21	张峻伟, 吕帅, 张正昊, 等. 基于样本效率优化的深度强化学习方法综述. 软件学报, 2022, 33 (11): 4217- 4238.
	ZHANG J W, LÜ S, ZHANG Z H, et al. Summary of deep reinforcement learning methods based on sample efficiency optimization. Journal of Software, 2022, 33 (11): 4217- 4238.
22	WOŁCZYK M, KRUTSYLO A. Remember more by recalling less: investigating the role of batch size in continual learning with experience replay(student abstract). Artificial Intelligence, 2021, 35 (18): 15923- 15924.
23	VAN HASSELT H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. New York, USA: ACM Press, 2016: 2094-2100.
24	WANG Z Y, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM Press, 2016: 1995-2003.
25	ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[EB/OL]. [2022-09-30]. https://arxiv.org/abs/1707.01495.pdf.
26	PACKER C, ABBEEL P, GONZALEZ J E. Hindsight task relabelling: experience replay for sparse reward meta-RL[EB/OL]. [2022-09-30]. https://arxiv.org/abs/2112.00901.pdf.
27	TAN R R P, IKEDA K, VERGARA J P C. Hindsight-combined and hindsight-prioritized experience replay. Berlin, Germany: Springer, 2020.
28	袁帅, 张莉莉, 顾琦然, 等. 移动机器人优先采样D3QN路径规划方法研究. 小型微型计算机系统, 2023, 44 (5): 923- 929.
	YUAN S, ZHANG L L, GU Q R, et al. Research on D3QN path planning method of mobile robot priority sampling. Journal of Chinese Computer Systems, 2023, 44 (5): 923- 929.

[1]	吴志强, 解庆, 李琳, 刘永坚. 基于多模态融合的图神经网络推荐算法[J]. 计算机工程, 2024, 50(1): 91-100.
[2]	张晓天, 王雅文, 谢志庆, 金大海, 宫云战. 面向类集成测试序列确定的强化学习方法[J]. 计算机工程, 2024, 50(1): 68-78.
[3]	刘昀抒, 申彦明, 齐恒, 尹宝才. 基于层次结构图的多跳知识图谱问答模型[J]. 计算机工程, 2024, 50(1): 101-109.
[4]	赵季红, 张富, 崔曌铭. 基于自适应动态预测的网络切片资源冲突优化[J]. 计算机工程, 2024, 50(1): 183-190.
[5]	蔡梓越, 谭北海, 余荣, 黄旭民, 王思明. 面向6G物联网设备协同的区块链动态分片[J]. 计算机工程, 2024, 50(1): 50-59.
[6]	顾嘉静, 杨丹, 聂铁铮, 寇月. 基于多视图融合跨层对比学习的推荐算法[J]. 计算机工程, 2024, 50(1): 120-128.
[7]	白尚旺, 王梦瑶, 胡静, 陈志泊. 多区域注意力的细粒度图像分类网络[J]. 计算机工程, 2024, 50(1): 271-278.
[8]	朱孟栩, 张文豪, 李国洪, 顾行发, 余涛, 郑逢杰, 张丽丽, 吴俣, 邴芳飞, 唐健雄. 基于卷积神经网络的高分六号卫星多光谱图像压缩[J]. 计算机工程, 2023, 49(9): 287-294.
[9]	杨静, 陆铭华, 马洁琼, 吴金平, 刘星璇. 基于交替循环神经网络的水下防御态势预测方法[J]. 计算机工程, 2023, 49(9): 69-78.
[10]	李现国, 李滨. 基于Transformer和多尺度CNN的图像去模糊[J]. 计算机工程, 2023, 49(9): 226-233, 245.
[11]	隋国华, 李陶然, 刘昊, 陈林, 汪卫. 基于图表示学习的领域知识图谱推理技术研究[J]. 计算机工程, 2023, 49(9): 89-98.
[12]	杜逸潇, 王红军, 李修和. 基于频谱地图的辐射源指纹定位方法研究[J]. 计算机工程, 2023, 49(9): 183-190, 198.
[13]	高御尧, 石明全, 秦渝, 陈建平, 周喜, 张鹏. 基于改进非全连接神经网络的站点客流预测模型[J]. 计算机工程, 2023, 49(9): 43-51.
[14]	胡水. 基于深度强化学习的智能兵棋推演决策方法[J]. 计算机工程, 2023, 49(9): 303-312.
[15]	刘晓黎, 王轶彤. 基于自监督学习的多密度图会话推荐[J]. 计算机工程, 2023, 49(9): 60-68, 78.

选择文件类型/文献管理软件名称

选择包含的内容