非参数化近似策略迭代并行强化学习算法

doi:10.19678/j.issn.1000-3428.0048935

计算机工程 ›› 2018, Vol. 44 ›› Issue (11): 313-320. doi: 10.19678/j.issn.1000-3428.0048935

• 开发研究与工程应用 • 上一篇

非参数化近似策略迭代并行强化学习算法

季挺,张华

南昌大学江西省机器人与焊接自动化重点实验室,南昌 330031

收稿日期:2017-10-12 出版日期:2018-11-15 发布日期:2018-11-15
作者简介:季挺(1982—),男,博士研究生,主研方向为智能机器人、智能控制;张华,教授。
基金资助:
国家高技术研究发展计划(SS2013AA041003)

Nonparametric Approximation Strategy Iteration Parallel Reinforcement Learning Algorithm

JI Ting,ZHANG Hua

Key Lab of Robot and Welding Automation of Jiangxi Province,Nanchang University,Nanchang 330031,China

Received:2017-10-12 Online:2018-11-15 Published:2018-11-15

摘要/Abstract

摘要： 针对在线近似策略迭代强化学习算法收敛速度较慢的问题,提出一种非参数化近似策略迭代并行强化学习算法。通过学习单元构建样本采集过程确定并行单元数量,基于径向基函数线性逼近结构设计强化学习单元,然后采用以样本空间完全覆盖为目标的估计方法实现单元自主构建,并基于近似策略迭代进行单元自主学习。其中,各单元通过平均加权法融合得到算法的整体策略。一级倒立摆仿真结果表明,与online LSPI算法和BLSPI算法相比,该算法在保持较高加速比的同时具有较高的效率,其控制参数更少,收敛速度更快。

关键词: 并行强化学习, 非参数化, 策略迭代, K均值聚类, 倒立摆

Abstract: To solve the problem of slow convergence speed of the online approximation strategy iteration reinforcement learning algorithm,a nonparametric approximation strategy iteration parallel reinforcement learning algorithm is proposed.The number of parallel units is determined through the sample collection process of building learning units,the reinforcement learning units are designed based on the linear approximation structure of Radial Basis Function(RBF),and then the independent construction of units is realized by using the estimation method with the target of full coverage of sample space.The independent learning of units is carried out based on approximation strategy iteration.Among them,the whole strategy of the algorithm is obtained by the average weighting method of each unit.Simulation results of first-order inverted pendulum show that,compared with online LSPI algorithm and BLSPI algorithm,this algorithm has higher efficiency while maintaining higher acceleration ratio,fewer control parameters and faster convergence speed.

Key words: parallel reinforcement learning, nonparametric, strategy iteration, K-means clustering, inverted pendulum

中图分类号:

TP181

季挺,张华. 非参数化近似策略迭代并行强化学习算法[J]. 计算机工程, 2018, 44(11): 313-320.

JI Ting,ZHANG Hua. Nonparametric Approximation Strategy Iteration Parallel Reinforcement Learning Algorithm[J]. Computer Engineering, 2018, 44(11): 313-320.

https://www.ecice06.com/CN/Y2018/V44/I11/313

参考文献

［1］LAGOUDAKIS M G,PARR R.Least squares policy iteration［J］.Journal of Machine Learning Research,2003,4(6):1107-1149.
［2］BUSONIU L,ERNST D,SCHUTTER B D,et al.Online least-squares policy iteration for reinforcement learning control［C］//Proceedings of 2010 American Control Conference.Washington D.C.,USA:IEEE Press,2010:486-491.
［3］周鑫,刘全,傅启明,等.一种批量最小二乘策略迭代方法［J］.计算机科学,2014,41(9):232-238.
［4］KRETCHMAR R M.Parallel reinforcement learning［C］//Proceedings of World Conference on Systemics.Washington D.C.,USA:IEEE Press,2002:60-74.
［5］杨旭东.并行强化学习研究［D］.苏州:苏州大学,2015.
［6］TSUGUHISA T,YUUKI N,KOJI Y,et al.Basic research on speed-up of reinforcement learning using parallel processing for combination value function［J］.Procedia Computer Science,2011,6:183-188.
［7］ENDA B,JIM D,ENDA H.A parallel framework for bayesian reinforcement learning［J］.Connection Science,2014,26(1):7-23.
［8］孟伟,韩学东.并行强化学习算法及其应用研究［J］.计算机工程与应用,2009,45(34):25-28,52.
［9］PATRICK M,JIM D,ENDA H.Parallel reinforcement learning for traffic signal control［J］.Procedia Computer Science,2015,52:956-961.
［10］GROUNDS M,KUDENKO D.Parallel reinforcement learning with linear function approximation［J］.Lecture Notes in Computer Science,2005,4865:60-74.
［11］GROUNDS M J.Scaling-up reinforcement learning using parallelization and symbolic planning［EB/OL］.［2017-10-05］.https://core.ac.uk/download/pdf/42604945.pdf.
［12］耿晓龙,李长江.基于人工神经网络的并行强化学习自适应路径规划［J］.科学技术与工程,2011,11(4):756-759.
［13］KIM M S,HONG G G,LEE J J.Online fuzzy Q-learning with extended rule and interpolation technique［C］//Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems.Washington D.C.,USA:IEEE Press,2002:757-762.
［14］于剑,程乾生.模糊聚类方法中的最佳聚类数的搜索范围［J］.中国科学(E辑),2002,32(2):274-280.
［15］KAUFMAN L,ROUSSEEUW P J.Finding groups in data:an introduction to cluster analysis［M］.New York,USA:John Wiley and Sons Ltd.,1990.

[1]	张晨阳, 黄腾, 吴壮壮. 基于K-Means聚类与深度学习的RGB-D SLAM算法[J]. 计算机工程, 2022, 48(1): 236-244,252.
[2]	宋万潼, 李冰锋, 费树岷. 基于先验知识的航拍绝缘子检测方法[J]. 计算机工程, 2021, 47(8): 301-307,314.
[3]	胡荣耀,刘星毅,程德波,何威,罗. 鲁棒自表达的低秩属性选择算法[J]. 计算机工程, 2017, 43(9): 43-50.
[4]	沈俊鑫,郭晓军,王文浩,杨旭. 基于协议组降低策略的二次并行k均值聚类算法[J]. 计算机工程, 2015, 41(8): 150-155.
[5]	李志强，蔺想红. 基于聚类的NSGA-II算法[J]. 计算机工程, 2013, 39(12): 186-190.
[6]	王晓燕, 曾庆宁, 粟秀尹. 基于PCA和HMM的心音自动识别系统[J]. 计算机工程, 2012, 38(20): 148-151.
[7]	张旭, 张向群, 赵伟, 何岩峰. 基于最近特征线的二维非参数化判别分析算法[J]. 计算机工程, 2012, 38(14): 171-172.
[8]	张猛, 付丽华, 刘智慧, 何婷婷, 魏志成. 基于留一准则的多尺度径向基函数网络[J]. 计算机工程, 2012, 38(12): 172-175.
[9]	高潮, 田翠翠, 郭永彩. 基于改进聚类中心分析法的红外行人分割[J]. 计算机工程, 2011, 37(6): 151-152.
[10]	吴永芳, 杨鑫, 徐敏, 张星. 基于K均值聚类的图割医学图像分割算法[J]. 计算机工程, 2011, 37(5): 232-234.
[11]	李盼池, 穆殿宝, 张巧翠. 基于改进QGA的T-S模糊控制器设计[J]. 计算机工程, 2011, 37(11): 22-24,27.
[12]	田新梅;吴秀清;刘莉. 大样本情况下的一种新的SVM迭代算法[J]. 计算机工程, 2007, 33(08): 205-207.

选择文件类型/文献管理软件名称

选择包含的内容

非参数化近似策略迭代并行强化学习算法

Nonparametric Approximation Strategy Iteration Parallel Reinforcement Learning Algorithm

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

非参数化近似策略迭代并行强化学习算法

Nonparametric Approximation Strategy Iteration Parallel Reinforcement Learning Algorithm

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价