[1] SILVER D, HUANG A, MADDISON C J, et al.Mastering the game of Go with deep neural networks and tree search[J].Nature, 2016, 529(7587):484-489. [2] SILVER D, HUBERT T, SCHRITTWIESER J, et al.A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J].Science, 2018, 362(6419):1140-1144. [3] YE D H, CHEN G B, ZHAO P L, et al.Supervised learning achieves human-level performance in MOBA games:a case study of Honor of Kings[EB/OL].[2021-02-25].https://arxiv.org/abs/2011.12582.pdf. [4] YE D H, LIU Z, SUN M F, et al.Mastering complex control in MOBA games with deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1912.09729.pdf. [5] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature, 2019, 575(7782):350-354. [6] SILVER D, NEWNHAM L, BARKER D, et al.Concurrent reinforcement learning from customer interactions[C]//Proceedings of 2013 International Conference on Machine Learning.New York, USA:ACM Press, 2013:924-932. [7] LEVINE S, PASTOR P, KRIZHEVSKY A, et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[EB/OL].[2021-02-25].https://arxiv.org/abs/1504.00702.pdf. [8] 崔丽群, 郭相卓, 郭军, 等.适用于偶发实时系统的过载控制策略[J].计算机工程, 2019, 45(6):108-114. CUI L Q, GUO X Z, GUO J, et al.Overload control strategy for sporadic real-time system[J].Computer Engineering, 2019, 45(6):108-114.(in Chinese) [9] GOODFELLOW I, BENGIO Y.Deep learning[M].Cambridge, USA:MIT Press, 2017. [10] SUTTON R, BARTO A.Reinforcement learning[M].Cambridge, USA:MIT Press, 2018. [11] GAVIN A R, NIRANJAN M.On-line Q-learning using connectionist systems[D].Cambridge, UK:University of Cambridge, 1994. [12] WATKINS C J C H, DAYAN P.Technical note:Q-learning[J].Machine Learning, 1992, 8(3):279-292. [13] KONIDARIS G, OSENTOSKI S, THOMAS P.Value function approximation in reinforcement learning using the Fourier basis[C]//Proceedings of 2011 AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2011:1-17. [14] CONNELL M E, CONNELL E, UTGOFF P E.Learning to control a dynamic physical system[J].Computational Intelligence, 1987, 3(1):330-337. [15] ATKESON C G, MOORE A W, SCHAAL S.Locally weighted learning for control[M].Berlin, Germany:Springer, 1997. [16] MNIH V, KAVUKCUOGLU K, SILVER D, et al.Human-level control through deep reinforcement learning[J].Nature, 2015, 518(7540):529-533. [17] VAN HASSELT H, GUEZ A, SILVER D.Deep reinforcement learning with double Q-learning[C]//Proceedings of 2016 AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2016:2094-2100. [18] HORGAN D, QUAN J, BUDDEN D, et al.Distributed prioritized experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.05952.pdf. [19] WANG Z, SCHAUL T, HESSEL M, et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of 2016 International Conference on Machine Learning.New York, USA:ACM Press, 2016:1995-2003. [20] HESTER T, VECERIK M, PIETQUIN O, et al.Deep Q-learning from demonstrations[EB/OL].[2021-02-25].https://arxiv.org/abs/1704.03732.pdf. [21] BELLEMARE M G, DABNEY W, REMI M.A distributional perspective on reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.06887.pdf. [22] DABNEY W, ROWLAND M, BELLEMARE M G, et al.Distributional reinforcement learning with quantile regression[EB/OL].[2021-02-25].https://arxiv.org/abs/1504.00702.pdf. [23] DABNEY W, OSTROVSKI G, SILVER D, et al.Implicit quantile networks for distributional reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1806.06923.pdf. [24] FORTUNATO M, AZAR M G, PIOT B, et al.Noisy networks for exploration[EB/OL].[2021-02-25].https://arxiv.org/abs/1706.10295.pdf. [25] HESSEL M, MODAYIL J, VAN HASSELT H, et al.Rainbow:combining improvements in deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1710.02298.pdf. [26] WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning, 1992, 8(3/4):229-256. [27] DEGRIS T, MARTHA W, SUTTON R S.Off-policy actor-critic[EB/OL].[2021-02-25].https://arxiv.org/abs/1205.4839.pdf. [28] SCHULMAN J, MORITZ P, LEVINE S, et al.High-dimensional continuous control using generalized advantage estimation[EB/OL].[2021-02-25].https://arxiv.org/abs/1506.02438.pdf. [29] SCHULMAN J, LEVINE S, ABBEEL P, et al.Trust region policy optimization[C]//Proceedings of 2015 International Conference on Machine Learning.New York, USA:ACM Press, 2015:1889-1897. [30] SCHULMAN J, WOLSKI F, DHARIWAL P, et al.Proximal policy optimization algorithms[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.06347.pdf. [31] WU Y, MANSIMOV E, LIAO S, et al.Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation[EB/OL].[2021-02-25].https://arxiv.org/abs/1708.05144.pdf. [32] SILVER D, LEVER G, HEESS N, et al.Deterministic policy gradient algorithms[C]//Proceedings of ICML'14.New York, USA:ACM Press, 2014:387-395. [33] LILLICRAP T P, HUNT J J, PRITZEL A, et al.Continuous control with deep reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1509.02971.pdf. [34] FUJIMOTO S, VAN HOOF H, MEGER D.Addressing function approximation error in actor-critic methods[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.09477.pdf. [35] MUNOS R, STEPLETON T, HARUTYUNYAN A, et al.Safe and efficient off-policy reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1606.02647.pdf. [36] WANG Z, BAPST V, HEESS N, et al.Sample efficient actor-critic with experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.01224.pdf. [37] HAARNOJA T, ZHOU A, ABBEEL P, et al.Soft actor-critic off-policy maximum entropy deep reinforcement learning with a stochastic actor[EB/OL].[2021-02-25].https://arxiv.org/abs/1801.01290.pdf. [38] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning.New York, USA:ACM Press, 2016:1928-1937. [39] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[EB/OL].[2021-02-25].https://openai.com/blog/baselines-acktr-a2c/. [40] ESPEHOLT L, SOYER H, MUNOS R, et al.IMPALA:scalable distributed deep-RL with importance weighted actor-learner architectures[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.01561.pdf. [41] HORGAN D, QUAN J, BUDDEN D, et al.Distributed prioritized experience replay[EB/OL].[2021-02-25].https://arxiv.org/abs/1803.00933.pdf. [42] RASMUSSEN C E, DEISENROTH M P.Probabilistic inference for fast learning in control[C]//Proceedings of 2008 European Workshop on Reinforcement Learning.Berlin, Germany:Springer, 2008:229-242. [43] DEISENROTH M, PILCO R C E.A model-based and data-efficient approach to policy search[C]//Proceedings of the 28th International Conference on Machine Learning.Washington D.C., USA:IEEE Press, 2011:465-472. [44] MCALLISTER R.Bayesian learning for data-efficient control[D].Cambridge, UK:University of Cambridge, 2017. [45] ANTHONY T, TIAN Z, BARBER D.Imagination-augmented agents for deep reinforcement learning[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:5360-5370. [46] CLAVERA I, ROTHFUSS J, SCHULMAN J, et al.Model-based reinforcement learning via meta-policy optimization[EB/OL].[2021-02-25].https://arxiv.org/abs/1809.05214.pdf. [47] NAGABANDI A, KAHN G, FEARING R S, et al.Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//Proceedings of 2018 IEEE International Conference on Robotics and Automation.Washington D.C., USA:IEEE Press, 2018:7559-7566. [48] FEINBERG V, WAN A, STOICA I, et al.Model-based value estimation for efficient model-free reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1803.00101.pdf. [49] BUCKMAN J, HAFNER D, TUCKER G, et al.Sample-efficient reinforcement learning with stochastic ensemble value expansion[C]//Proceedings of 2019 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2019:8224-8234. [50] KURUTACH T, CLAVERA I, DUAN Y, et al.Model-ensemble trust-region policy optimization[EB/OL].[2021-02-25].https://arxiv.org/abs/1802.10592.pdf. [51] HOUTHOOFT R, CHEN X, DUAN Y, et al.VIME:variational information maximizing exploration[C]//Proceedings of 2016 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2016:65-74. [52] BURDA Y, EDWARDS H, PATHAK D, et al.Large-scale study of curiosity-driven learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1808.04355.pdf. [53] PATHAK D, AGRAWAL P, EFROS A A, et al.Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops.Washington D.C., USA:IEEE Press, 2017:488-489. [54] BURDA Y, EDWARDS H, STORKEY A, et al.Exploration by random network distillation[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.12894.pdf. [55] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of 2016 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2016:1471-1479. [56] OSTROVSKI G, BELLEMARE M G, OORD A, et al.Count-based exploration with neural density models[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.01310.pdf. [57] TANG H, HOUTHOOFT R, FOOTE D, et al.#Exploration:a study of count-based exploration for deep reinforcement learning[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:2753-2762. [58] KRISHNAMURTHY R, LAKSHMINARAYANAN A S, KUMAR P, et al.Hierarchical reinforcement learning using spatio-temporal abstractions and deep neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1604.06057.pdf. [59] RAFATI J, NOELLE D C.Learning representations in model-free hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.10096.pdf. [60] SUKHBAATAR S, LIN Z M, KOSTRIKOV I, et al.Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.05407.pdf. [61] VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al.FeUdal Networks for hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1703.01161.pdf. [62] NACHUM O, GU S X, LEE H, et al.Data-efficient hierarchical reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1805.08296.pdf. [63] BACON P L, HARB J, PRECUP D.The option-critic architecture[EB/OL].[2021-02-25].https://arxiv.org/abs/1609.05140.pdf. [64] LEVINE S, POPOVIC Z, KOLTUN V, et al.Feature construction for inverse reinforcement learning[C]//Proceedings of 2010 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2010:1-10. [65] JIN M, DAMIANOU A, ABBEEL P, et al.Inverse reinforcement learning via deep Gaussian process[EB/OL].[2021-02-25].https://arxiv.org/abs/1512.08065.pdf. [66] FINN C, LEVINE S, ABBEEL P.Guided cost learning:deep inverse optimal control via policy optimization[C]//Proceedings of 2016 International Conference on Machine Learning.New York, USA:ACM Press, 2016:49-58. [67] HO J, ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2016:4572-4580. [68] PENG X B, KANAZAWA A, TOYER S, et al.Variational discriminator bottleneck:improving imitation learning, inverse RL, and GANs by constraining information flow[EB/OL].[2021-02-25].https://arxiv.org/abs/1810.00821.pdf. [69] RUSU A A, RABINOWITZ N C, DESJARDINS G, et al.Progressive neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1606.04671.pdf. [70] FERNANDO C, BANARSE D, BLUNDELL C, et al.PathNet:evolution channels gradient descent in super neural networks[EB/OL].[2021-02-25].https://arxiv.org/abs/1701.08734.pdf. [71] RUSU A A, COLMENAREJO S G, GULCEHRE C, et al.Policy Distillation[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.06295.pdf. [72] PARISOTTO E, BA J L, SALAKHUTDINOV R.Actor-Mimic:deep multitask and transfer reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1511.06342.pdf. [73] SCHAUL T, HORGAN D, GREGOR K, et al.Universal value function approximators[C]//Proceedings of 2015 International Conference on Machine Learning.New York, USA:ACM Press, 2015:1312-1320. [74] JADERBERG M, MNIH V, CZARNECKI W M, et al.Reinforcement learning with unsupervised auxiliary tasks[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.05397.pdf. [75] ANDRYCHOWICZ M, WOLSKI F, RAY A, et al.Hindsight experience replay[C]//Proceedings of 2017 International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:5048-5058. [76] DUAN Y, SCHULMAN J, CHEN X, et al.RL2:fast reinforcement learning via slow reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.02779.pdf. [77] MISHRA N, ROHANINEJAD M, CHEN X, et al.A simple neural attentive meta-learner[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.03141.pdf. [78] FAKOOR R, CHAUDHARI P, SOATTO S, et al.Meta-Q-learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1910.00125.pdf. [79] FINN C, ABBEEL P, LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning.New York, USA:ACM Press, 2017:1126-1135. [80] RAKELLY K, ZHOU A, QUILLEN D, et al.Efficient off-policy meta-reinforcement learning via probabilistic context variables[EB/OL].[2021-02-25].https://arxiv.org/abs/1903.08254.pdf. [81] GU S X, LILLICRAP T, GHAHRAMANI Z, et al.Q-Prop:sample-efficient policy gradient with an off-policy critic[EB/OL].[2021-02-25].https://arxiv.org/abs/1611.02247.pdf. [82] NACHUM O, NOROUZI M, XU K, et al.Bridging the gap between value and policy based reinforcement learning[EB/OL].[2021-02-25].https://arxiv.org/abs/1702.08892.pdf. [83] NACHUM O, NOROUZI M, XU K, et al.Trust-PCL:an off-policy trust region method for continuous control[EB/OL].[2021-02-25].https://arxiv.org/abs/1707.01891.pdf. [84] TEH Y W, BAPST V, CZARNECKI W M, et al.Distral:robust multitask reinforcement learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2017:4499-4509. [85] VAN HASSELT H, GUEZ A, HESSEL M, et al.Learning values across many orders of magnitude[EB/OL].[2021-02-25].https://arxiv.org/abs/1602.07714.pdf. |