Multi-Agent Reinforcement Learning Based on Rational Curiosity in Sparse Scenarios

doi:10.19678/j.issn.1000-3428.0064365

Computer Engineering ›› 2023, Vol. 49 ›› Issue (5): 302-309. doi: 10.19678/j.issn.1000-3428.0064365

• Development Research and Engineering Application • Previous Articles Next Articles

Multi-Agent Reinforcement Learning Based on Rational Curiosity in Sparse Scenarios

JIN Zhijun^1,2, WANG Hao^1,2, FANG Baofu^1,2

1. School of Computer and Information, Hefei University of Technology, Hefei 230009, China;
2. Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei University of Technology, Hefei 230009, China

Received:2022-04-02 Revised:2022-05-10 Published:2022-05-26

稀疏场景下基于理性好奇心的多智能体强化学习

金志军^1,2, 王浩^1,2, 方宝富^1,2

1. 合肥工业大学计算机与信息学院, 合肥 230009;
2. 合肥工业大学大数据知识工程教育部重点实验室, 合肥 230009

作者简介:金志军(1993-),男,硕士,主研方向为强化学习;王浩,教授、博士;方宝富,副教授、博士。
基金资助:
国家自然科学基金（61872327）；安徽省自然科学基金（1708085MF146）；民航飞行技术与飞行安全重点实验室开放基金（FZ2020KF07）。

Abstract

Abstract: Reinforcement Learning（RL） has been increasingly applied to Multi-Agent Systems（MAS）. In RL，the reward signal plays a role in guiding the learning of the agent. However，in MAS，this task is highly complex and the feedback from the environment may be obtained only after task completion，which results in a sparse reward，which significantly reduces the convergence speed and efficiency of the algorithm. To address this sparse reward problem，this paper proposes a multi-agent RL method based on rational curiosity. First，inspired by the theory of intrinsic motivation，the idea of curiosity is extended to MAS，and a rational curiosity reward mechanism is proposed. This reward mechanism uses a decomposing and summing network structure to encode the joint states of different permutations into the same feature representation to reduce the exploration space of the joint state. Moreover，the prediction errors of the network were used as intrinsic rewards to guide the agents to explore novel and useful states.Subsequently，the double value function network was introduced to evaluate the Q value by alleviating its over-estimation and variance and to improve sample utilization. The experimental evaluation was performed in an environment consisting of the pursuit task and the cooperative navigation task. The results thus obtained show that，compared with the baseline algorithm applied to the most difficult pursuit task，the victory rate of the proposed method is higher by about 15%，its time step is lower by approximately 20%，and it exhibits a faster convergence rate in the cooperative navigation task.

Key words: sparse reward, Multi-Agent Systems（MAS）, Reinforcement Learning（RL）, intrinsic motivation, curiosity

摘要： 强化学习当前越来越多地应用于多智能体系统。在强化学习中,奖励信号起引导智能体学习的作用，然而多智能体系统任务复杂，可能只在任务结束时才能获得环境的反馈，导致奖励稀疏，大幅降底算法的收敛速度和效率。为解决稀疏奖励问题，提出一种基于理性好奇心的多智能体强化学习方法。受内在动机理论的启发，将好奇心思想扩展到多智能体中，并给出理性好奇心奖励机制，利用分解求和的网络结构将不同排列的联合状态编码到同一特征表示，减少联合状态的探索空间，将网络的预测误差作为内在奖励，引导智能体去研究新颖且有用的效用状态。在此基础上，引入双值函数网络对Q值进行评估，采用最小化算子计算目标值，缓解Q值的过估计偏差和方差，并采用均值优化策略提高样本利用。在追捕任务和合作导航任务的环境中进行实验评估，结果表明,在最困难的追捕任务中,该方法相较于基线算法,胜率提高15%左右，所需时间步降低20%左右，在合作导航任务中也具有较快的收敛速度。

关键词: 稀疏奖励, 多智能体系统, 强化学习, 内在动机, 好奇心

CLC Number:

TP391

JIN Zhijun, WANG Hao, FANG Baofu. Multi-Agent Reinforcement Learning Based on Rational Curiosity in Sparse Scenarios[J]. Computer Engineering, 2023, 49(5): 302-309.

金志军, 王浩, 方宝富. 稀疏场景下基于理性好奇心的多智能体强化学习[J]. 计算机工程, 2023, 49(5): 302-309.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064365

http://www.ecice06.com/EN/Y2023/V49/I5/302

Figures/Tables 10

References

[1] RIZK Y,AWAD M,TUNSTEL E W.Decision making in multi-agent systems:a survey[J].IEEE Transactions on Cognitive and Developmental Systems,2018,10(3):514-529.
[2] SUTTON R S,BARTO A G.An introduction[M].Cambridge,USA:MIT Press,1998.
[3] HERNANDEZ-LEAL P,KARTAL B,TAYLOR M E.A survey and critique of multi-agent deep reinforcement learning[J].Autonomous Agents and Multi-Agent Systems,2019,33(6):750-797.
[4] CHU T S,WANG J,CODECÀ L,et al.Multi-agent deep reinforcement learning for large-scale traffic signal control[J].IEEE Transactions on Intelligent Transportation Systems,2020,21(3):1086-1095.
[5] EL SALLAB A,ABDOU M,PEROT E,et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging,2017,29(19):70-76.
[6] 叶佩文,贾向东,杨小蓉,等.面向车联网的多智能体强化学习边云协同卸载[J].计算机工程,2021,47(4):13-20. YE P W,JIA X D,YANG X R,et al.Collaborative edge and cloud offloading for Internet of vehicles using multi-agent reinforcement learning[J].Computer Engineering,2021,47(4):13-20.(in Chinese)
[7] REN H,BEN-TZVI P.Advising reinforcement learning toward scaling agents in continuous control environments with sparse rewards[J].Engineering Applications of Artificial Intelligence,2020,90:103515.
[8] STADIE B C,LEVINE S,ABBEEL P.Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL].[2022-03-01].https://arxiv.org/abs/1507.00814.
[9] PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 34th International Conference on Machine Learning.New York,USA:ACM Press,2017:2778-2787.
[10] BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[EB/OL].[2022-03-01].https://arxiv.org/abs/1810.12894.
[11] SCHAFER L.Curiosity in multi-agent reinforcement learning[D].Edinburgh,UK:The University of Edinburgh,2019.
[12] ANSCHEL O,BARAM N,SHIMKIN N.Averaged-DQN:variance reduction and stabilization for deep reinforcement learning[EB/OL].[2022-03-01].https://arxiv.org/abs/1611.01929.
[13] FUJIMOTO S,HOOF H V,MEGER D.Addressing function approximation error in actor-critic methods[EB/OL].[2022-03-01].https://arxiv.org/pdf/1802.09477.pdf.
[14] LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]//Proceedings of the 4th International Conference on Learning Representations.Washington D.C.,USA:IEEE Press,2016:432-446.
[15] LAN Q,PAN Y,FYSHE A,et al.Maxmin Q-learning:controlling the estimation Bias of Q-learning[EB/OL].[2022-03-01].https://dblp.uni-trier.de/rec/conf/iclr/LanPFW20.html.
[16] FENGJIA O,ZHANG F.A TD3-based multi-agent deep reinforcement learning method in mixed cooperation-competition environment[J].Neurocomputing,2020,411:206-215.
[17] LOWE R,WU Y,TAMAR A,et al.Multi-agent actor critic for mixed cooperative-competitive environments[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge,USA:MIT Press,2017:6379-6390.
[18] OLIEHOEK F A,SPAAN M T J,VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research,2008,32:289-353.
[19] NEHMZOW U,GATSOULIS Y,KERR E,et al.Novelty detection as an intrinsic motivation for cumulative learning robots[M].Berlin,Germany:Springer,2012.
[20] OUDEYER P Y,KAPLAN F.What is intrinsic motivation?A typology of computational approaches[J].Frontiers in Neurorobotics,2007,1:6.
[21] BARTO A G.Intrinsic motivation and reinforcement learning[M].Berlin,Germany:Springer,2012.
[22] SIDDIQUE N,DHAKAN P,RANO I,et al.A review of the relationship between novelty,intrinsic motivation and reinforcement learning[J].Journal of Behavioral Robotics,2017,8(1):58-69.
[23] ZHENG L,CHEN J,WANG J,et al.Episodic multi-agent reinforcement learning with curiosity-driven exploration[EB/OL].[2022-03-01].https://arxiv.org/abs/2111.11032.
[24] ZAHEER M,KOTTER S,RAVANBAKHSH S,et al.Deep sets[C]//Proceedings of Advaces in Neural Information Processing System.Cambridge,USA:MIT Press,2017:3191-3401.
[25] MAHAJAN A,RASHID T,SAMVELYAN M,et al.MAVEN:multi-agent variational exploration[EB/OL].[2022-03-01].https://arxiv.org/abs/1910.07483.
[26] KINGMA D P,BA J.Adam:a method for stochastic optimization[EB/OL].[2022-03-01].https://arxiv.org/abs/1412.6980.

Please choose a citation manager

Content to export