基于最佳子策略记忆的强化探索策略

doi:10.19678/j.issn.1000-3428.0060193

摘要/Abstract

摘要： 现有强化学习探索策略存在过度探索的问题，导致智能体收敛速度减慢。通过设计一个基于奖励排序的存储表（M表）和ε-greedy改进算法，提出基于最佳子策略记忆的强化探索策略。将奖励值大于零的样本以子策略的形式存入M表，使其基于奖励降序排序，在整个训练过程中，使用与表中相似且奖励值较高的样本以子策略形式替换表中子策略，从而在表中形成一个能有效产生目前最优奖励的动作集合，提高探索的针对性，而不是随机探索。同时，在ε-greedy算法基础上按一定的概率分配，使智能体通过使用M表探索得到MEG探索策略。基于此，智能体在一定概率下将当前状态与M表中子策略匹配，若相似，则将表中与其相似的子策略对应动作反馈给智能体，智能体执行该动作。实验结果表明，该策略能够有效缓解过度探索现象，与DQN系列算法和非DQN系列的A2C算法相比，其在Playing Atari 2600游戏的控制问题中获得了更高的平均奖励值。

关键词: 强化学习, 过度探索, MEG探索, 相似度, 最佳子策略

Abstract: Existing reinforcement learning exploration strategies are limited by over exploration, resulting in the slow convergence of agents.To address this issue, in this study, a storage table(M table) is designed and the ε-greedy algorithm is improved upon to propose a reinforcement exploration strategy based on best sub-strategy memory.The samples with reward values greater than zero are stored in the M table in the form of sub-strategies, which are then sorted in descending order based on the reward.During the training process, samples with similar and higher reward values are used to replace the sub-strategies in the table, to form an action set that can effectively produce the current optimal reward in the table, while making the exploration process more relevant rather than random.Additionally, based on the ε-greedy algorithm, the sub-strategies are distributed according to a certain probability, such that the agent can obtain the M-Epsilon-Greedy(MEG) exploration strategy by using the M table.Under this strategy, the agent matches the current state with the sub-strategy in the M table for a certain probability, whereby in the case of a match, the corresponding action of the sub-strategy in the table is fed back to the agent, and the agent executes the action.Experimental results indicate that this strategy can effectively alleviate the phenomenon of over exploration.Compared with the DQN series algorithm and non-DQN series A2C algorithm, a higher reward value is obtained in the control problem of Playing Atari 2600 game using the proposed strategy.

Key words: reinforcement learning, excessive exploration, M-Epsilon-Greedy(MEG) exploration, similarity, best sub-strategy

中图分类号:

TP181

周瑞朋, 秦进. 基于最佳子策略记忆的强化探索策略[J]. 计算机工程, 2022, 48(2): 106-112.

ZHOU Ruipeng, QIN Jin. Reinforcement Exploration Strategy Based on Best Sub-Strategy Memory[J]. Computer Engineering, 2022, 48(2): 106-112.

https://www.ecice06.com/CN/Y2022/V48/I2/106

图/表 12

20220228182801

20220228182804

20220228182808

20220228182813

20220228182820

20220228182824

20220228182828

20220228182832

20220228182836

20220228182840

20220228182844

20220228182847

参考文献

[1] SUTTON R S, BARTO A G.Reinforcement learning:an introduction[M].Cambridge, USA:MIT Press, 2018.
[2] KEARNS M, SINGH S.Near-optimal reinforcement learning in polynomial time[J].Machine Learning, 2002, 49(2/3):209-232.
[3] JAKSCH T, ORTNER R, AUER P.Near-optimal regret bounds for reinforcement learning[J].Journal of Machine Learning Research, 2010, 11(4):1-5.
[4] MONTAGUE P R.Reinforcement learning:an introduction, by Sutton, R.S. and Barto, A.G.[J].Trends in Cognitive Sciences, 1999, 3(9):360.
[5] WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning, 1992, 8(3/4):229-256.
[6] GEIST M, PIETQUIN O.Managing uncertainty within value function approximation in reinforcement learning[C]//Proceedings of Active Learning and Experimental Design Workshop.Sardinia, Italy:[s.n.], 2010:92.
[7] BELLEMARE M, SRINIVASAN S, OSTROVSKI G, et al.Unifying count-based exploration and intrinsic motivation[J].Advances in Neural Information Processing Systems, 2016, 29:1471-1479.
[8] OSTROVSKI G, BELLEMARE M G, OORD A, et al.Count-based exploration with neural density models[C]//Proceedings of the 34th International Conference on Machine Learning.Sydney, Australia:JMLR, 2017:2721-2730.
[9] PATHAK D, AGRAWAL P, EFROS A A, et al.Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:16-17.
[10] PLAPPERT M, HOUTHOOFT R, DHARIWAL P, et al.Parameter space noise for exploration[EB/OL].(2017-06-06)[2020-09-10].https://arxiv.org/pdf/1706.01905.pdf.
[11] FORTUNATO M, AZAR M G, PIOT B, et al.Noisy networks for exploration[EB/OL].(2017-06-30)[2020-09-10].https://arxiv.org/pdf/1706.10295v1.pdf.
[12] ZHANG X, MA Y, SINGLA A.Task-agnostic exploration in reinforcement learning[EB/OL].(2017-06-16)[2020-09-10].https://arxiv.org/pdf/2006.09497v1.pdf.
[13] COMANICI G, PRECUP D.Optimal policy switching algorithms for reinforcement learning[C]//Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems.Washington D.C., USA:IEEE Press, 2010:709-714.
[14] OTTERLO M, WIERING M.Reinforcement learning and Markov decision processes[M]//WIERING M, OTTERLO M.Reinforcement learning.Berlin, Germany:Springer, 2012:3-42.
[15] FRANÇOIS-LAVET V, HENDERSON P, ISLAM R, et al.An introduction to deep reinforcement learning[J].Foundations and Trends in Machine Learning, 2018, 11(3/4):1-145.
[16] SHAO K, TANG Z, ZHU Y, et al.A survey of deep reinforcement learning in video games[EB/OL].(2019-12-23)[2020-09-10].https://arxiv.org/pdf/1912.10944.pdf.
[17] HAARNOJA T, PONG V, ZHOU A, et al.Composable deep reinforcement learning for robotic manipulation[C]//Proceedings of IEEE International Conference on Robotics and Automation.Washington D.C., USA:IEEE Press, 2018:6244-6251.
[18] WOLF T, DEBUT L, SANH V, et al.HuggingFace's transformers:state-of-the-art natural language processing[C]//Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.[S.l.]:Association for Computational Linguistics, 2020:38-45.
[19] ESTEVA A, ROBICQUET A, RAMSUNDAR B, et al.A guide to deep learning in healthcare[J].Nature Medicine, 2019, 25(1):24-29.
[20] YANG D, ZHAO L, LIN Z, et al.Fully parameterized quantile function for distributional reinforcement learning[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.New York, USA:ACM Press, 2019:6193-6202.
[21] DABNEY W, OSTROVSKI G, SILVER D, et al.Implicit quantile networks for distributional reinforcement learning[C]//Proceedings of the 35th International Conference on Machine Learning.[S.l.]:PMLR, 2018:1096-1105.
[22] DABNEY W, ROWLAND M, BELLEMARE M G, et al.Distributional reinforcement learning with quantile regression[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.[S.l.]:AAAI, 2018:1-5.
[23] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning.Washington D.C., USA:IEEE Press, 2016:1928-1937.
[24] ZHANG H, CHEN H, XIAO C, et al.Robust deep reinforcement learning against adversarial perturbations on observations[C]//Proceedings of NeurIPS 2020.Washington D.C., USA:IEEE Press, 2020:1-14.
[25] TOROMANOFF M, WIRBEL E, MOUTARDE F.Is deep reinforcement learning really superhuman on Atari?[C]//Proceedings of the 39th Conference on Neural Information Processing Systems.Vancouver, Canada:[s.n.], 2019:1-5.

选择文件类型/文献管理软件名称

选择包含的内容