作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (5): 74-81. doi: 10.19678/j.issn.1000-3428.0061437

• 人工智能与模式识别 • 上一篇    下一篇

一种基于多步竞争网络的多智能体协作方法

厉子凡, 王浩, 方宝富   

  1. 合肥工业大学 计算机与信息学院, 合肥 230601
  • 收稿日期:2021-04-25 修回日期:2021-05-31 发布日期:2021-06-02
  • 作者简介:厉子凡(1996—),男,硕士研究生,主研方向为多智能体深度强化学习;王浩,教授、博士、博士生导师;方宝富,副教授、博士。
  • 基金资助:
    国家自然科学基金(61876206);中央高校基本科研业务费专项资金(ACAIM190102);安徽省自然科学基金(1708085MF146);民航飞行技术与飞行安全重点实验室开放基金(FZ2020KF15)。

A Method for Multi-Agent Cooperation Based on Multi-Step Dueling Network

LI Zifan, WANG Hao, FANG Baofu   

  1. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
  • Received:2021-04-25 Revised:2021-05-31 Published:2021-06-02

摘要: 多智能体高效协作是多智能体深度强化学习的重要目标,然而多智能体决策系统中存在的环境非平稳、维数灾难等问题使得这一目标难以实现。现有值分解方法可在环境平稳性和智能体拓展性之间取得较好平衡,但忽视了智能体策略网络的重要性,并且在学习联合动作值函数时未充分利用经验池中保存的完整历史轨迹。提出一种基于多智能体多步竞争网络的多智能体协作方法,在训练过程中使用智能体网络和价值网络对智能体动作评估和环境状态评估进行解耦,同时针对整条历史轨迹完成多步学习以估计时间差分目标,通过优化近似联合动作值函数的混合网络集中且端到端地训练分散的多智能体协作策略。实验结果表明,该方法在6种场景中的平均胜率均优于基于值分解网络、单调值函数分解、值函数变换分解、反事实多智能体策略梯度的多智能体协作方法,并且具有较快的收敛速度和较好的稳定性。

关键词: 多智能体协作, 深度强化学习, 值分解, 多步竞争网络, 动作值函数

Abstract: Multi-agent efficient cooperation is an important goal in Multi-Agent Deep Reinforcement Learning(MADRL);however, environmental non-stationarity and dimensionality disasters in multi-agent decision-making systems render it difficult to achieve this goal.Existing value-decomposition methods can achieve a good balance between environment stationarity and agent scalability.Nevertheless, some value-decomposition methods disregard the importance of the agent-policy network and do not fully utilize the full historical trajectories saved in the experience pool when learning joint action-value functions.Hence, a method for multi-agent cooperation based on Multi-agent Multi-step Dueling Network(MMDN) is proposed herein.First, action estimation and state estimation are decoupled through an independent agent network and a value network during training;additionally, the temporal-difference target is estimated via multistep learning for the entire history trajectory.Second, decentralized multi-agent cooperation policies are trained via a centralized end-to-end mode by optimizing a mixing network that approximates the joint action-value function.Experimental results show that the average winning rate of this method in six scenarios is better than those of multi-agent cooperative methods based on the Value-Decomposition Network(VDN), QMIX, QTRAN, and Counterfactual Multi-Agent(COMA) policy gradient.Additionally, it offers a higher convergence speed and better stability.

Key words: multi-agent cooperation, Deep Reinforcement Learning(DRL), value-decomposition, multi-step dueling network, action value function

中图分类号: