作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (7): 118-124. doi: 10.19678/j.issn.1000-3428.0065296

• 人工智能与模式识别 • 上一篇    下一篇

远离旧区域和避免回路的强化探索方法

蔡丽娇1, 秦进1,*, 陈双2   

  1. 1. 贵州大学 计算机科学与技术学院 公共大数据国家重点实验室, 贵阳 550025
    2. 贵州道坦坦科技股份有限公司, 贵阳 550025
  • 收稿日期:2022-07-20 出版日期:2023-07-15 发布日期:2023-07-14
  • 通讯作者: 秦进
  • 作者简介:

    蔡丽娇(1995—),女,硕士研究生,主研方向为强化学习

    陈双,高级工程师、硕士

  • 基金资助:
    贵州省科技计划项目(黔科合基础[2020]1Y275); 贵州省科技计划项目(黔科合支撑[2020]3Y004)

Reinforcement Exploration Method to Keep Away from Old Areas and Avoid Loops

Lijiao CAI1, Jin QIN1,*, Shuang CHEN2   

  1. 1. State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
    2. Guizhou Door To Time Science and Technology Co., Ltd., Guiyang 550025, China
  • Received:2022-07-20 Online:2023-07-15 Published:2023-07-14
  • Contact: Jin QIN

摘要:

以内在动机为导向的探索类强化学习中,通常根据智能体对状态的熟悉程度产生内在奖励,难以获得较合适的近似度量方法,且这种长期累计度量的方式没有重视状态在其所处episode中的作用。Anchor方法使用锚代替分层强化学习中的子目标,鼓励智能体以远离锚的方式进行探索。受Anchor方法的启发,根据转移状态与同一个episode中历史状态之间的距离设计内在奖励函数,进而提出远离旧区域和避免回路的强化探索方法。将当前episode中部分历史状态组成的集合作为区域,周期性更新区域为最近访问的状态集合,根据转移状态与区域的最小距离给予智能体内在奖励,使智能体远离当前最近访问过的旧区域。将转移状态的连续前驱状态作为窗口并规定窗口大小,根据窗口范围内以转移状态为终点的最短回路长度给予内在奖励,防止智能体走回路。在经典的奖励稀疏环境MiniGrid中的实验结果表明,该方法避免了对状态熟悉程度的度量,同时以一个episode为周期对环境进行探索,有效提升了智能体的探索能力。

关键词: 深度强化学习, 奖励稀疏任务, 内在奖励, 旧区域, 回路

Abstract:

In intrinsic motivation-oriented exploratory reinforcement learning, intrinsic rewards are typically generated based on an agent's familiarity with the states. An appropriate approximate measure is difficult to obtain, and this long-term cumulative measure does not consider the role of the state in an episode.The Anchor method replaces subgoals in hierarchical reinforcement learning with anchors, thus encouraging the agent to explore in areas distant from the anchors. Inspired by this, an intrinsic reward function is designed based on the distance between the next state and the historical states in the same episode, and a reinforcement exploration method to keep Away from old Areas and Avoid Loops(AAAL) is proposed. Considering the set of partial historical states in this episode as a area and periodically treating the most recently visited state set as a new area, an intrinsic reward is allocated to the agent based on the shortest distance between the next state and area such that the agent is distant from the currently visited old area.Treating the successive precursor states of the next state as a window and specifying the window size, an intrinsic reward is allocated based on the shortest loop length of the window, with the next state regarded as the end point such that the agent can avoid walking the circuit.The experimental results in the classic reward sparse MiniGrid environment show that the AAAL method no longer requires measurements of familiarity with the states; in fact, it can explore the environment with an episode as a cycle and effectively improve the exploration ability of the agent.

Key words: deep reinforcement learning, reward sparse task, intrinsic reward, old area, loop