作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (5): 150-159. doi: 10.19678/j.issn.1000-3428.0070256

• 计算智能与模式识别 • 上一篇    下一篇

基于改进TD3算法的移动机器人路径规划

李明明, 潘子豪*()   

  1. 西安科技大学通信与信息工程学院, 陕西 西安 710600
  • 收稿日期:2024-08-14 修回日期:2024-10-19 出版日期:2026-05-15 发布日期:2024-12-18
  • 通讯作者: 潘子豪
  • 作者简介:

    李明明, 女, 副教授、硕士, 主研方向为机器人技术

    潘子豪, 硕士研究生

  • 基金资助:
    国家自然科学基金(62401459)

Mobile Robot Path Planning Based on Improved TD3 Algorithm

LI Mingming, PAN Zihao*()   

  1. College of Communication and Information Engineering, Xi'an University of Science and Technology, Xi'an 710600, Shaanxi, China
  • Received:2024-08-14 Revised:2024-10-19 Online:2026-05-15 Published:2024-12-18
  • Contact: PAN Zihao

摘要:

传统的移动机器人路径规划算法通常需要在有地图的条件下才能有效规划路径。相比之下基于深度强化学习(DRL)的路径规划因其在无地图条件下的导航能力而备受关注。然而, 传统的DRL路径规划算法往往存在样本利用率低、训练速度慢、泛化能力不足等问题。针对上述问题, 对双延迟深度确定性(TD3)策略梯度算法进行改进以提高其在移动机器人路径规划中的性能。首先, 针对TD3算法持续探索空间能力有限的问题, 对其探索策略进行改进, 通过使用具有时间相关性的粉红噪声来增强算法的持续探索空间能力。其次, 结合n步方法和损失调整近似Actor优先(LA3P)经验回放方法, n步方法将经验回放池中的即时奖励扩展为n步的累计折扣奖励, 能够更准确地捕捉长期奖励信号, 而LA3P方法通过对n步经验的高效利用, 提高样本利用率和算法的性能。最后, 通过在Gazebo中搭建了3个不同的环境进行实验, 并和多种算法进行比较。实验结果表明, 改进算法在训练时间、平均成功率、平均距离等方面更具优势, 证明了改进算法的有效性。

关键词: 路径规划, 深度强化学习, 双延迟深度确定性策略梯度, 粉红噪声, 损失调整近似Actor优先经验回放

Abstract:

The traditional path-planning algorithm of a mobile robot usually requires a map to effectively plan the path. By contrast, path planning based on Deep Reinforcement Learning (DRL) does not require maps for navigation, owing to which it has received considerable attention. However, traditional DRL path-planning algorithms often face challenges such as low sample utilization, slow training speed, and insufficient generalization ability. To solve these issues, the Twin Delayed Deep Deterministic (TD3) policy gradient algorithm is improved to enhance its path-planning performance for mobile robots. First, to solve the problem of the TD3 algorithm having limited ability for continuous space exploration, the exploration strategy is improved and the continuous space exploration ability of the algorithm is enhanced using pink noise with time correlation. Second, the n-step method is combined with the Loss-Adjusted Approximate Actor Prioritized (LA3P) experience replay method. The n-step method can capture the long-term reward signal more accurately by expanding the immediate reward in the experience replay buffer to the n-step cumulative discount reward, whereas the LA3P method can improve the sample utilization and performance of the algorithm by efficiently using the n-step experience. Finally, three different environments are built in Gazebo for the experiments and compared with various algorithms. The experimental results show that the improved algorithm has more advantages in terms of training time, average success rate, and average distance, which proves the effectiveness of the improved algorithm.

Key words: path planning, Deep Reinforcement Learning (DRL), Twin Delayed Deep Deterministic (TD3) policy gradient, pink noise, Loss-Adjusted Approximate Actor Prioritized (LA3P) experience replay