深度强化学习研究综述

doi:10.19678/j.issn.1000-3428.0061116

摘要/Abstract

摘要： 深度强化学习是指利用深度神经网络的特征表示能力对强化学习的状态、动作、价值等函数进行拟合，以提升强化学习模型性能，广泛应用于电子游戏、机械控制、推荐系统、金融投资等领域。回顾深度强化学习方法的主要发展历程，根据当前研究目标对深度强化学习方法进行分类，分析与讨论高维状态动作空间任务上的算法收敛、复杂应用场景下的算法样本效率提高、奖励函数稀疏或无明确定义情况下的算法探索以及多任务场景下的算法泛化性能增强问题，总结与归纳4类深度强化学习方法的研究现状，同时针对深度强化学习技术的未来发展方向进行展望。

关键词: 深度学习, 强化学习, 深度强化学习, 逆向强化学习, 基于模型的元学习

Abstract: Deep Reinforcement Learning(DRL) refers to using feature representation capabilities of deep neural networks to fit Reinforcement Learning(RL) functions, including the state, action, and value, so the performance of RL models can be improved.It has been widely used in video games, mechanical control, recommendation system, financial investment and other fields.This article reviews the development history of DRL methods, and categorizes them based on the existing research goals.Then the article analyzes the algorithm convergence problem in high-dimensional state action space tasks, problem of improving sampling efficiency of the algorithms in the complex application scenarios, the algorithm exploration problem in the complex scenarios where the reward functions are sparse or inexplicitly defined, and the problem of enhancing the generalization ability of the algorithm in the multitasking scenarios.Finally, the article summarizes the current development of the four kinds of DRL methods, and discusses the future development trends of DRL technology.

Key words: Deep Learning(DL), Reinforcement Learning(RL), Deep Reinforcement Learning(DRL), Inverse Reinforcement Learning(IRL), Model-Based Meta-Learning(MBML)

中图分类号:

TP391

杨思明, 单征, 丁煜, 李刚伟. 深度强化学习研究综述[J]. 计算机工程, 2021, 47(12): 19-29.

YANG Siming, SHAN Zheng, DING Yu, LI Gangwei. Survey of Research on Deep Reinforcement Learning[J]. Computer Engineering, 2021, 47(12): 19-29.

https://www.ecice06.com/CN/Y2021/V47/I12/19

图/表 5

20211213181340

20211213181344

20211213181347

20211213181351

20211213181355

参考文献

[1] SILVER D, HUANG A, MADDISON C J, et al.Mastering the game of Go with deep neural networks and tree search[J].Nature, 2016, 529(7587):484-489.
[2] SILVER D, HUBERT T, SCHRITTWIESER J, et al.A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J].Science, 2018, 362(6419):1140-1144.
[3] YE D H, CHEN G B, ZHAO P L, et al.Supervised learning achieves human-level performance in MOBA games:a case study of Honor of Kings[EB/OL].[2021-02-25].https://arxiv.org/abs/2011.12582.pdf.
[4] YE D H, LIU Z, SUN M F, et al.Mastering complex control in MOBA games with deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1912.09729.pdf.
[5] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature, 2019, 575(7782):350-354.
[6] SILVER D, NEWNHAM L, BARKER D, et al.Concurrent reinforcement learning from customer interactions[C]//Proceedings of 2013 International Conference on Machine Learning.New York, USA:ACM Press, 2013:924-932.
[7] LEVINE S, PASTOR P, KRIZHEVSKY A, et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[EB/OL].[2021-02-25].https://arxiv.org/abs/1504.00702.pdf.
[8] 崔丽群, 郭相卓, 郭军, 等.适用于偶发实时系统的过载控制策略[J].计算机工程, 2019, 45(6):108-114. CUI L Q, GUO X Z, GUO J, et al.Overload control strategy for sporadic real-time system[J].Computer Engineering, 2019, 45(6):108-114.(in Chinese)
[9] GOODFELLOW I, BENGIO Y.Deep learning[M].Cambridge, USA:MIT Press, 2017.
[10] SUTTON R, BARTO A.Reinforcement learning[M].Cambridge, USA:MIT Press, 2018.
[11] GAVIN A R, NIRANJAN M.On-line Q-learning using connectionist systems[D].Cambridge, UK:University of Cambridge, 1994.
[12] WATKINS C J C H, DAYAN P.Technical note:Q-learning[J].Machine Learning, 1992, 8(3):279-292.
[13] KONIDARIS G, OSENTOSKI S, THOMAS P.Value function approximation in reinforcement learning using the Fourier basis[C]//Proceedings of 2011 AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2011:1-17.
[14] CONNELL M E, CONNELL E, UTGOFF P E.Learning to control a dynamic physical system[J].Computational Intelligence, 1987, 3(1):330-337.
[15] ATKESON C G, MOORE A W, SCHAAL S.Locally weighted learning for control[M].Berlin, Germany:Springer, 1997.
[16] MNIH V, KAVUKCUOGLU K, SILVER D, et al.Human-level control through deep reinforcement learning[J].Nature, 2015, 518(7540):529-533.
[17] VAN HASSELT H, GUEZ A, SILVER D.Deep reinforcement learning with double Q-learning[C]//Proceedings of 2016 AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2016:2094-2100.
[18] HORGAN D, QUAN J, BUDDEN D, et al.Distributed prioritized experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.05952.pdf.
[19] WANG Z, SCHAUL T, HESSEL M, et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of 2016 International Conference on Machine Learning.New York, USA:ACM Press, 2016:1995-2003.
[20] HESTER T, VECERIK M, PIETQUIN O, et al.Deep Q-learning from demonstrations[EB/OL].[2021-02-25].https://arxiv.org/abs/1704.03732.pdf.
[21] BELLEMARE M G, DABNEY W, REMI M.A distributional perspective on reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.06887.pdf.
[22] DABNEY W, ROWLAND M, BELLEMARE M G, et al.Distributional reinforcement learning with quantile regression[EB/OL].[2021-02-25].https://arxiv.org/abs/1504.00702.pdf.
[23] DABNEY W, OSTROVSKI G, SILVER D, et al.Implicit quantile networks for distributional reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1806.06923.pdf.
[24] FORTUNATO M, AZAR M G, PIOT B, et al.Noisy networks for exploration[EB/OL].[2021-02-25].https://arxiv.org/abs/1706.10295.pdf.
[25] HESSEL M, MODAYIL J, VAN HASSELT H, et al.Rainbow:combining improvements in deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1710.02298.pdf.
[26] WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning, 1992, 8(3/4):229-256.
[27] DEGRIS T, MARTHA W, SUTTON R S.Off-policy actor-critic[EB/OL].[2021-02-25].https://arxiv.org/abs/1205.4839.pdf.
[28] SCHULMAN J, MORITZ P, LEVINE S, et al.High-dimensional continuous control using generalized advantage estimation[EB/OL].[2021-02-25].https://arxiv.org/abs/1506.02438.pdf.
[29] SCHULMAN J, LEVINE S, ABBEEL P, et al.Trust region policy optimization[C]//Proceedings of 2015 International Conference on Machine Learning.New York, USA:ACM Press, 2015:1889-1897.
[30] SCHULMAN J, WOLSKI F, DHARIWAL P, et al.Proximal policy optimization algorithms[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.06347.pdf.
[31] WU Y, MANSIMOV E, LIAO S, et al.Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation[EB/OL].[2021-02-25].https://arxiv.org/abs/1708.05144.pdf.
[32] SILVER D, LEVER G, HEESS N, et al.Deterministic policy gradient algorithms[C]//Proceedings of ICML'14.New York, USA:ACM Press, 2014:387-395.
[33] LILLICRAP T P, HUNT J J, PRITZEL A, et al.Continuous control with deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1509.02971.pdf.
[34] FUJIMOTO S, VAN HOOF H, MEGER D.Addressing function approximation error in actor-critic methods[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.09477.pdf.
[35] MUNOS R, STEPLETON T, HARUTYUNYAN A, et al.Safe and efficient off-policy reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1606.02647.pdf.
[36] WANG Z, BAPST V, HEESS N, et al.Sample efficient actor-critic with experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.01224.pdf.
[37] HAARNOJA T, ZHOU A, ABBEEL P, et al.Soft actor-critic off-policy maximum entropy deep reinforcement learning with a stochastic actor[EB/OL].[2021-02-25].https://arxiv.org/abs/1801.01290.pdf.
[38] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning.New York, USA:ACM Press, 2016:1928-1937.
[39] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[EB/OL].[2021-02-25].https://openai.com/blog/baselines-acktr-a2c/.
[40] ESPEHOLT L, SOYER H, MUNOS R, et al.IMPALA:scalable distributed deep-RL with importance weighted actor-learner architectures[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.01561.pdf.
[41] HORGAN D, QUAN J, BUDDEN D, et al.Distributed prioritized experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1803.00933.pdf.
[42] RASMUSSEN C E, DEISENROTH M P.Probabilistic inference for fast learning in control[C]//Proceedings of 2008 European Workshop on Reinforcement Learning.Berlin, Germany:Springer, 2008:229-242.
[43] DEISENROTH M, PILCO R C E.A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on Machine Learning.Washington D.C., USA:IEEE Press, 2011:465-472.
[44] MCALLISTER R.Bayesian learning for data-efficient control[D].Cambridge, UK:University of Cambridge, 2017.
[45] ANTHONY T, TIAN Z, BARBER D.Imagination-augmented agents for deep reinforcement learning[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:5360-5370.
[46] CLAVERA I, ROTHFUSS J, SCHULMAN J, et al.Model-based reinforcement learning via meta-policy optimization[EB/OL].[2021-02-25].https://arxiv.org/abs/1809.05214.pdf.
[47] NAGABANDI A, KAHN G, FEARING R S, et al.Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//Proceedings of 2018 IEEE International Conference on Robotics and Automation.Washington D.C., USA:IEEE Press, 2018:7559-7566.
[48] FEINBERG V, WAN A, STOICA I, et al.Model-based value estimation for efficient model-free reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1803.00101.pdf.
[49] BUCKMAN J, HAFNER D, TUCKER G, et al.Sample-efficient reinforcement learning with stochastic ensemble value expansion[C]//Proceedings of 2019 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2019:8224-8234.
[50] KURUTACH T, CLAVERA I, DUAN Y, et al.Model-ensemble trust-region policy optimization[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.10592.pdf.
[51] HOUTHOOFT R, CHEN X, DUAN Y, et al.VIME:variational information maximizing exploration[C]//Proceedings of 2016 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2016:65-74.
[52] BURDA Y, EDWARDS H, PATHAK D, et al.Large-scale study of curiosity-driven learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1808.04355.pdf.
[53] PATHAK D, AGRAWAL P, EFROS A A, et al.Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops.Washington D.C., USA:IEEE Press, 2017:488-489.
[54] BURDA Y, EDWARDS H, STORKEY A, et al.Exploration by random network distillation[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.12894.pdf.
[55] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of 2016 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2016:1471-1479.
[56] OSTROVSKI G, BELLEMARE M G, OORD A, et al.Count-based exploration with neural density models[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.01310.pdf.
[57] TANG H, HOUTHOOFT R, FOOTE D, et al.#Exploration:a study of count-based exploration for deep reinforcement learning[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:2753-2762.
[58] KRISHNAMURTHY R, LAKSHMINARAYANAN A S, KUMAR P, et al.Hierarchical reinforcement learning using spatio-temporal abstractions and deep neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1604.06057.pdf.
[59] RAFATI J, NOELLE D C.Learning representations in model-free hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.10096.pdf.
[60] SUKHBAATAR S, LIN Z M, KOSTRIKOV I, et al.Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.05407.pdf.
[61] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al.FeUdal Networks for hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.01161.pdf.
[62] NACHUM O, GU S X, LEE H, et al.Data-efficient hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1805.08296.pdf.
[63] BACON P L, HARB J, PRECUP D.The option-critic architecture[EB/OL].[2021-02-25].https://arxiv.org/abs/1609.05140.pdf.
[64] LEVINE S, POPOVIC Z, KOLTUN V, et al.Feature construction for inverse reinforcement learning[C]//Proceedings of 2010 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2010:1-10.
[65] JIN M, DAMIANOU A, ABBEEL P, et al.Inverse reinforcement learning via deep Gaussian process[EB/OL].[2021-02-25].https://arxiv.org/abs/1512.08065.pdf.
[66] FINN C, LEVINE S, ABBEEL P.Guided cost learning:deep inverse optimal control via policy optimization[C]//Proceedings of 2016 International Conference on Machine Learning.New York, USA:ACM Press, 2016:49-58.
[67] HO J, ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:4572-4580.
[68] PENG X B, KANAZAWA A, TOYER S, et al.Variational discriminator bottleneck:improving imitation learning, inverse RL, and GANs by constraining information flow[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.00821.pdf.
[69] RUSU A A, RABINOWITZ N C, DESJARDINS G, et al.Progressive neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1606.04671.pdf.
[70] FERNANDO C, BANARSE D, BLUNDELL C, et al.PathNet:evolution channels gradient descent in super neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1701.08734.pdf.
[71] RUSU A A, COLMENAREJO S G, GULCEHRE C, et al.Policy Distillation[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.06295.pdf.
[72] PARISOTTO E, BA J L, SALAKHUTDINOV R.Actor-Mimic:deep multitask and transfer reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.06342.pdf.
[73] SCHAUL T, HORGAN D, GREGOR K, et al.Universal value function approximators[C]//Proceedings of 2015 International Conference on Machine Learning.New York, USA:ACM Press, 2015:1312-1320.
[74] JADERBERG M, MNIH V, CZARNECKI W M, et al.Reinforcement learning with unsupervised auxiliary tasks[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.05397.pdf.
[75] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al.Hindsight experience replay[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:5048-5058.
[76] DUAN Y, SCHULMAN J, CHEN X, et al.RL²:fast reinforcement learning via slow reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.02779.pdf.
[77] MISHRA N, ROHANINEJAD M, CHEN X, et al.A simple neural attentive meta-learner[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.03141.pdf.
[78] FAKOOR R, CHAUDHARI P, SOATTO S, et al.Meta-Q-learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1910.00125.pdf.
[79] FINN C, ABBEEL P, LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.New York, USA:ACM Press, 2017:1126-1135.
[80] RAKELLY K, ZHOU A, QUILLEN D, et al.Efficient off-policy meta-reinforcement learning via probabilistic context variables[EB/OL].[2021-02-25].https://arxiv.org/abs/1903.08254.pdf.
[81] GU S X, LILLICRAP T, GHAHRAMANI Z, et al.Q-Prop:sample-efficient policy gradient with an off-policy critic[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.02247.pdf.
[82] NACHUM O, NOROUZI M, XU K, et al.Bridging the gap between value and policy based reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1702.08892.pdf.
[83] NACHUM O, NOROUZI M, XU K, et al.Trust-PCL:an off-policy trust region method for continuous control[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.01891.pdf.
[84] TEH Y W, BAPST V, CZARNECKI W M, et al.Distral:robust multitask reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2017:4499-4509.
[85] VAN HASSELT H, GUEZ A, HESSEL M, et al.Learning values across many orders of magnitude[EB/OL].[2021-02-25].https://arxiv.org/abs/1602.07714.pdf.

选择文件类型/文献管理软件名称

选择包含的内容