Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2023, Vol. 49 ›› Issue (5): 302-309. doi: 10.19678/j.issn.1000-3428.0064365

• Development Research and Engineering Application • Previous Articles     Next Articles

Multi-Agent Reinforcement Learning Based on Rational Curiosity in Sparse Scenarios

JIN Zhijun1,2, WANG Hao1,2, FANG Baofu1,2   

  1. 1. School of Computer and Information, Hefei University of Technology, Hefei 230009, China;
    2. Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei University of Technology, Hefei 230009, China
  • Received:2022-04-02 Revised:2022-05-10 Published:2022-05-26

稀疏场景下基于理性好奇心的多智能体强化学习

金志军1,2, 王浩1,2, 方宝富1,2   

  1. 1. 合肥工业大学 计算机与信息学院, 合肥 230009;
    2. 合肥工业大学 大数据知识工程教育部重点实验室, 合肥 230009
  • 作者简介:金志军(1993-),男,硕士,主研方向为强化学习;王浩,教授、博士;方宝富,副教授、博士。
  • 基金资助:
    国家自然科学基金(61872327);安徽省自然科学基金(1708085MF146);民航飞行技术与飞行安全重点实验室开放基金(FZ2020KF07)。

Abstract: Reinforcement Learning(RL) has been increasingly applied to Multi-Agent Systems(MAS). In RL,the reward signal plays a role in guiding the learning of the agent. However,in MAS,this task is highly complex and the feedback from the environment may be obtained only after task completion,which results in a sparse reward,which significantly reduces the convergence speed and efficiency of the algorithm. To address this sparse reward problem,this paper proposes a multi-agent RL method based on rational curiosity. First,inspired by the theory of intrinsic motivation,the idea of curiosity is extended to MAS,and a rational curiosity reward mechanism is proposed. This reward mechanism uses a decomposing and summing network structure to encode the joint states of different permutations into the same feature representation to reduce the exploration space of the joint state. Moreover,the prediction errors of the network were used as intrinsic rewards to guide the agents to explore novel and useful states.Subsequently,the double value function network was introduced to evaluate the Q value by alleviating its over-estimation and variance and to improve sample utilization. The experimental evaluation was performed in an environment consisting of the pursuit task and the cooperative navigation task. The results thus obtained show that,compared with the baseline algorithm applied to the most difficult pursuit task,the victory rate of the proposed method is higher by about 15%,its time step is lower by approximately 20%,and it exhibits a faster convergence rate in the cooperative navigation task.

Key words: sparse reward, Multi-Agent Systems(MAS), Reinforcement Learning(RL), intrinsic motivation, curiosity

摘要: 强化学习当前越来越多地应用于多智能体系统。在强化学习中,奖励信号起引导智能体学习的作用,然而多智能体系统任务复杂,可能只在任务结束时才能获得环境的反馈,导致奖励稀疏,大幅降底算法的收敛速度和效率。为解决稀疏奖励问题,提出一种基于理性好奇心的多智能体强化学习方法。受内在动机理论的启发,将好奇心思想扩展到多智能体中,并给出理性好奇心奖励机制,利用分解求和的网络结构将不同排列的联合状态编码到同一特征表示,减少联合状态的探索空间,将网络的预测误差作为内在奖励,引导智能体去研究新颖且有用的效用状态。在此基础上,引入双值函数网络对Q值进行评估,采用最小化算子计算目标值,缓解Q值的过估计偏差和方差,并采用均值优化策略提高样本利用。在追捕任务和合作导航任务的环境中进行实验评估,结果表明,在最困难的追捕任务中,该方法相较于基线算法,胜率提高15%左右,所需时间步降低20%左右,在合作导航任务中也具有较快的收敛速度。

关键词: 稀疏奖励, 多智能体系统, 强化学习, 内在动机, 好奇心

CLC Number: