Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2022, Vol. 48 ›› Issue (2): 106-112. doi: 10.19678/j.issn.1000-3428.0060193

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Reinforcement Exploration Strategy Based on Best Sub-Strategy Memory

ZHOU Ruipeng, QIN Jin   

  1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Received:2020-12-04 Revised:2021-01-28 Published:2021-02-01

基于最佳子策略记忆的强化探索策略

周瑞朋, 秦进   

  1. 贵州大学 计算机科学与技术学院, 贵阳 550025
  • 作者简介:周瑞朋(1995-),男,硕士研究生,主研方向为机器学习、强化学习;秦进(通信作者),副教授、博士。
  • 基金资助:
    国家自然科学基金(61562009);贵州省科学技术基金(黔科合支撑[2020]3Y004号)。

Abstract: Existing reinforcement learning exploration strategies are limited by over exploration, resulting in the slow convergence of agents.To address this issue, in this study, a storage table(M table) is designed and the ε-greedy algorithm is improved upon to propose a reinforcement exploration strategy based on best sub-strategy memory.The samples with reward values greater than zero are stored in the M table in the form of sub-strategies, which are then sorted in descending order based on the reward.During the training process, samples with similar and higher reward values are used to replace the sub-strategies in the table, to form an action set that can effectively produce the current optimal reward in the table, while making the exploration process more relevant rather than random.Additionally, based on the ε-greedy algorithm, the sub-strategies are distributed according to a certain probability, such that the agent can obtain the M-Epsilon-Greedy(MEG) exploration strategy by using the M table.Under this strategy, the agent matches the current state with the sub-strategy in the M table for a certain probability, whereby in the case of a match, the corresponding action of the sub-strategy in the table is fed back to the agent, and the agent executes the action.Experimental results indicate that this strategy can effectively alleviate the phenomenon of over exploration.Compared with the DQN series algorithm and non-DQN series A2C algorithm, a higher reward value is obtained in the control problem of Playing Atari 2600 game using the proposed strategy.

Key words: reinforcement learning, excessive exploration, M-Epsilon-Greedy(MEG) exploration, similarity, best sub-strategy

摘要: 现有强化学习探索策略存在过度探索的问题,导致智能体收敛速度减慢。通过设计一个基于奖励排序的存储表(M表)和ε-greedy改进算法,提出基于最佳子策略记忆的强化探索策略。将奖励值大于零的样本以子策略的形式存入M表,使其基于奖励降序排序,在整个训练过程中,使用与表中相似且奖励值较高的样本以子策略形式替换表中子策略,从而在表中形成一个能有效产生目前最优奖励的动作集合,提高探索的针对性,而不是随机探索。同时,在ε-greedy算法基础上按一定的概率分配,使智能体通过使用M表探索得到MEG探索策略。基于此,智能体在一定概率下将当前状态与M表中子策略匹配,若相似,则将表中与其相似的子策略对应动作反馈给智能体,智能体执行该动作。实验结果表明,该策略能够有效缓解过度探索现象,与DQN系列算法和非DQN系列的A2C算法相比,其在Playing Atari 2600游戏的控制问题中获得了更高的平均奖励值。

关键词: 强化学习, 过度探索, MEG探索, 相似度, 最佳子策略

CLC Number: