作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (5): 78-85,93. doi: 10.19678/j.issn.1000-3428.0054557

• 人工智能与模式识别 • 上一篇    下一篇

基于二阶时序差分误差的双网络DQN算法

陈建平a,b, 周鑫a,b, 傅启明a,b, 高振a, 付保川a,b, 吴宏杰a   

  1. 苏州科技大学 a. 电子与信息工程学院;b. 江苏省建筑智慧节能重点实验室, 江苏 苏州 215009
  • 收稿日期:2019-04-10 修回日期:2019-05-14 发布日期:2019-05-20
  • 作者简介:陈建平(1963-),男,教授,主研方向为建筑节能、智能信息处理;周鑫,硕士研究生;傅启明(通信作者),讲师;高振,副教授;付保川,教授;吴宏杰,副教授。
  • 基金资助:
    国家自然科学基金(61772357,61672371);江苏省重点研发计划项目(BE2017663);江苏省研究生科研与实践创新计划项目(SJCX18-0881)。

Dual Network DQN Algorithm Based on Second-order Temporal Difference Error

CHEN Jianpinga,b, ZHOU Xina,b, FU Qiminga,b, GAO Zhena, FU Baochuana,b, WU Hongjiea   

  1. a. School of Electronic and Information Engineering;b. Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
  • Received:2019-04-10 Revised:2019-05-14 Published:2019-05-20

摘要: 针对深度Q网络(DQN)算法因过估计导致收敛稳定性差的问题,在传统时序差分(TD)的基础上提出N阶TD误差的概念,设计基于二阶TD误差的双网络DQN算法。构造基于二阶TD误差的值函数更新公式,同时结合DQN算法建立双网络模型,得到两个同构的值函数网络分别用于表示先后两轮的值函数,协同更新网络参数,以提高DQN算法中值函数估计的稳定性。基于Open AI Gym平台的实验结果表明,在解决Mountain Car和Cart Pole问题方面,该算法较经典DQN算法具有更好的收敛稳定性。

关键词: 深度强化学习, 马尔科夫决策过程, 深度Q网络, 二阶时序差分误差, 梯度下降

Abstract: Aiming at the problem of poor convergence stability caused by overestimation of Depth Q-Network(DQN) algorithm,on the basis of traditional Temporal Difference(TD),the concept of n-order TD error is proposed and a dual-network DQN algorithm based on second-order TD error is designed.A value function updating formula based on second-order TD error is constructed.Meanwhile,a two-network model is established in combination with DQN algorithm,and two isomorphic value function networks are obtained,which are respectively used to represent the value functions of two successive rounds,and the network parameters are cooperatively updated to improve the stability of value function estimation in DQN algorithm.Experimental results based on the Open AI Gym platform show that,the proposed algorithm has better convergence stability compared with the classical DQN algorithm in solving the Mountain Car and Cart Pole problems.

Key words: Deep Reinforcement Learning(DRL), Markov Decision Process(MDP), Deep Q-Network(DQN), second-order Temporal Difference(TD) error, gradient descent

中图分类号: