远离旧区域和避免回路的强化探索方法

doi:10.19678/j.issn.1000-3428.0065296

摘要/Abstract

摘要：

以内在动机为导向的探索类强化学习中，通常根据智能体对状态的熟悉程度产生内在奖励，难以获得较合适的近似度量方法，且这种长期累计度量的方式没有重视状态在其所处episode中的作用。Anchor方法使用锚代替分层强化学习中的子目标，鼓励智能体以远离锚的方式进行探索。受Anchor方法的启发，根据转移状态与同一个episode中历史状态之间的距离设计内在奖励函数，进而提出远离旧区域和避免回路的强化探索方法。将当前episode中部分历史状态组成的集合作为区域，周期性更新区域为最近访问的状态集合，根据转移状态与区域的最小距离给予智能体内在奖励，使智能体远离当前最近访问过的旧区域。将转移状态的连续前驱状态作为窗口并规定窗口大小，根据窗口范围内以转移状态为终点的最短回路长度给予内在奖励，防止智能体走回路。在经典的奖励稀疏环境MiniGrid中的实验结果表明，该方法避免了对状态熟悉程度的度量，同时以一个episode为周期对环境进行探索，有效提升了智能体的探索能力。

关键词: 深度强化学习, 奖励稀疏任务, 内在奖励, 旧区域, 回路

Abstract:

In intrinsic motivation-oriented exploratory reinforcement learning, intrinsic rewards are typically generated based on an agent's familiarity with the states. An appropriate approximate measure is difficult to obtain, and this long-term cumulative measure does not consider the role of the state in an episode.The Anchor method replaces subgoals in hierarchical reinforcement learning with anchors, thus encouraging the agent to explore in areas distant from the anchors. Inspired by this, an intrinsic reward function is designed based on the distance between the next state and the historical states in the same episode, and a reinforcement exploration method to keep Away from old Areas and Avoid Loops(AAAL) is proposed. Considering the set of partial historical states in this episode as a area and periodically treating the most recently visited state set as a new area, an intrinsic reward is allocated to the agent based on the shortest distance between the next state and area such that the agent is distant from the currently visited old area.Treating the successive precursor states of the next state as a window and specifying the window size, an intrinsic reward is allocated based on the shortest loop length of the window, with the next state regarded as the end point such that the agent can avoid walking the circuit.The experimental results in the classic reward sparse MiniGrid environment show that the AAAL method no longer requires measurements of familiarity with the states; in fact, it can explore the environment with an episode as a cycle and effectively improve the exploration ability of the agent.

Key words: deep reinforcement learning, reward sparse task, intrinsic reward, old area, loop

蔡丽娇, 秦进, 陈双. 远离旧区域和避免回路的强化探索方法[J]. 计算机工程, 2023, 49(7): 118-124.

Lijiao CAI, Jin QIN, Shuang CHEN. Reinforcement Exploration Method to Keep Away from Old Areas and Avoid Loops[J]. Computer Engineering, 2023, 49(7): 118-124.

https://www.ecice06.com/CN/Y2023/V49/I7/118

图/表 12

参考文献 26

1	HAO J Y, YANG T P, TANG H Y, et al. Exploration in deep reinforcement learning: from single-agent to multiagent domain[EB/OL]. [2022-06-04]. https://arxiv.org/abs/2109.06668.
2	PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington D. C., USA: IEEE Press, 2017: 488-489.
3	BADIA A P, SPRECHMANN P, VITVITSKYI A, et al. Never give up: learning directed exploration strategies[EB/OL]. [2022-06-04]. https://arxiv.org/abs/2002.06038.
4	ZHANG T J, XU H Z, WANG X L, et al. NovelD: a simple yet effective exploration criterion[EB/OL]. [2022-06-04]. https://openreview.net/forum?id=CYUzpnOkFJp.
5	HOUTHOOFT R, CHEN X, DUAN Y, et al. VIME: variational information maximizing exploration[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1605.09674.
6	BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1810.12894.
7	BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2016: 1479-1487.
8	PARISI S, DEAN V, PATHAK D, et al. Interesting object, curious agent: learning task-agnostic exploration[EB/OL]. [2022-06-04]. https://arxiv.org/abs/2111.13119.
9	DAYAN P, HINTON G E. Feudal reinforcement learning[C]//Proceedings of NIPS'92. New York, USA: ACM Press, 1992: 271-278.
10	SCHAUL T, HORGAN D, GREGOR K, et al. Universal value function approximators[C]//Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM Press, 2015: 1312-1320.
11	KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1604.06057.
12	VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. FeUdal networks for hierarchical reinforcement learning[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1703.01161.
13	ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1707.01495.
14	PATERIA S, SUBAGDJA B, TAN A H, et al. End-to-end hierarchical reinforcement learning with integrated subgoal discovery. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(12): 7778- 7790. doi: 10.1109/TNNLS.2021.3087733
15	LI R J, CAI Z L, HUANG T Y, et al. Anchor: the achieved goal to replace the subgoal for hierarchical reinforcement learning. Knowledge-Based Systems, 2021, 225, 107128. doi: 10.1016/j.knosys.2021.107128
16	SUTTON R S, BARTO A G. Reinforcement learning: an introduction. London, UK: MIT Press, 2018.
17	刘全, 翟建伟, 章宗长, 等. 深度强化学习综述. 计算机学报, 2018, 41(1): 1- 27. URL
	LIU Q, ZHAI J W, ZHANG Z Z, et al. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1- 27. URL
18	MINAR M R, NAHER J. Recent advances in deep learning: an overview[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1807.08169.
19	LEVINE S, FINN C, DARRELL T, et al. End-to-end training of deep visuomotor policies[J]. Journal of Machine Learning Research, 2016, 17: 39: 1-40.
20	ARADI S. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(2): 740- 759. doi: 10.1109/TITS.2020.3024655
21	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529- 533. doi: 10.1038/nature14236
22	BABAEIZADEH M, FROSIO I, TYREE S, et al. Reinforcement learning through asynchronous advantage actor-critic on a GPU[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1611.06256.
23	ESPEHOLT L, MARINIER R, STANCZYK P, et al. SEED RL: scalable and efficient deep-RL with accelerated central inference[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1910.06591.
24	WIJMANS E, KADIAN A, MORCOS A, et al. DD-PPO: learning near-perfect PointGoal navigators from 2.5 billion frames[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1911.00357.
25	ESPEHOLT L, SOYER H, MUNOS R, et al. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1802.01561.
26	STANTON C, CLUNE J. Deep curiosity search: intra-life exploration can improve performance on challenging deep reinforcement learning problems[EB/OL]. [2022-06-04]. https://arxiv.org/abs/1806.00553.

任务	最大episode长度
Unlock	288
DoorKey-8x8	640
KeyCorridorS3R3	270
UnlockPickup	288
ObstructedMaze1Dlh	288
BlockedUnlockPickup	576
MultiRoom-N6	120
ObstructedMaze2Dlhb	576

任务	最大episode长度
Unlock	288
DoorKey-8x8	640
KeyCorridorS3R3	270
UnlockPickup	288
ObstructedMaze1Dlh	288
BlockedUnlockPickup	576
MultiRoom-N6	120
ObstructedMaze2Dlhb	576

任务	奖励	E-Only	C-BET	Anchor-2	Count	RND	AAAL_cut	AAAL
Unlock	Mean	0.561	0.628	0.738	0.511	0.216	0.721	0.745
Unlock	Final	0.960	0.960	0.961	0.960	0.950	0.961	0.964
DoorKey-8x8	Mean	0.700	0.845	0.901	0.793	0.505	0.891	0.906
DoorKey-8x8	Final	0.974	0.974	0.976	0.976	0.976	0.975	0.976
KeyCorridorS3R3	Mean	0.000	0.227	0.129	0.416	0.098	0.417	0.491
KeyCorridorS3R3	Final	0.000	0.882	0.776	0.884	0.844	0.897	0.892
UnlockPickup	Mean	0.002	0.395	0.533	0.699	0.526	0.540	0.587
UnlockPickup	Final	0.014	0.944	0.938	0.946	0.945	0.942	0.940
ObstructedMaze1Dlh	Mean	0.001	0.458	0.478	0.723	0.461	0.508	0.595
ObstructedMaze1Dlh	Final	0.000	0.943	0.937	0.943	0.940	0.937	0.939
BlockedUnlockPickup	Mean	0.000	0.581	0.580	0.535	0.346	0.685	0.707
BlockedUnlockPickup	Final	0.000	0.961	0.960	0.963	0.959	0.961	0.961
MultiRoom-N6	Mean	0.000	0.345	0.247	0.390	0.000	0.355	0.462
MultiRoom-N6	Final	0.000	0.629	0.605	0.629	0.000	0.603	0.621
ObstructedMaze2Dlhb	Mean	0.000	0.228	0.379	0.000	0.001	0.391	0.464
ObstructedMaze2Dlhb	Final	0.000	0.776	0.819	0.000	0.000	0.823	0.812

任务	奖励	E-Only	C-BET	Anchor-2	Count	RND	AAAL_cut	AAAL
Unlock	Mean	0.561	0.628	0.738	0.511	0.216	0.721	0.745
Unlock	Final	0.960	0.960	0.961	0.960	0.950	0.961	0.964
DoorKey-8x8	Mean	0.700	0.845	0.901	0.793	0.505	0.891	0.906
DoorKey-8x8	Final	0.974	0.974	0.976	0.976	0.976	0.975	0.976
KeyCorridorS3R3	Mean	0.000	0.227	0.129	0.416	0.098	0.417	0.491
KeyCorridorS3R3	Final	0.000	0.882	0.776	0.884	0.844	0.897	0.892
UnlockPickup	Mean	0.002	0.395	0.533	0.699	0.526	0.540	0.587
UnlockPickup	Final	0.014	0.944	0.938	0.946	0.945	0.942	0.940
ObstructedMaze1Dlh	Mean	0.001	0.458	0.478	0.723	0.461	0.508	0.595
ObstructedMaze1Dlh	Final	0.000	0.943	0.937	0.943	0.940	0.937	0.939
BlockedUnlockPickup	Mean	0.000	0.581	0.580	0.535	0.346	0.685	0.707
BlockedUnlockPickup	Final	0.000	0.961	0.960	0.963	0.959	0.961	0.961
MultiRoom-N6	Mean	0.000	0.345	0.247	0.390	0.000	0.355	0.462
MultiRoom-N6	Final	0.000	0.629	0.605	0.629	0.000	0.603	0.621
ObstructedMaze2Dlhb	Mean	0.000	0.228	0.379	0.000	0.001	0.391	0.464
ObstructedMaze2Dlhb	Final	0.000	0.776	0.819	0.000	0.000	0.823	0.812

任务	E-Only	C-BET	Anchor-2	Count	RND	AAAL_cut	AAAL
Unlock	129	110	78	144	228	83	76
DoorKey-8x8	198	103	66	135	319	72	63
KeyCorridorS3R3	270	211	237	161	245	161	139
UnlockPickup	287	176	136	88	138	134	121
ObstructedMaze1Dlh	288	158	152	82	157	144	119
BlockedUnlockPickup	576	245	245	270	379	185	172
MultiRoom-N6	120	82	93	76	120	81	69
ObstructedMaze2Dlhb	576	450	364	576	575	357	314

选择文件类型/文献管理软件名称

选择包含的内容