A Method for Multi-Agent Cooperation Based on Multi-Step Dueling Network

doi:10.19678/j.issn.1000-3428.0061437

Abstract

Abstract: Multi-agent efficient cooperation is an important goal in Multi-Agent Deep Reinforcement Learning(MADRL);however, environmental non-stationarity and dimensionality disasters in multi-agent decision-making systems render it difficult to achieve this goal.Existing value-decomposition methods can achieve a good balance between environment stationarity and agent scalability.Nevertheless, some value-decomposition methods disregard the importance of the agent-policy network and do not fully utilize the full historical trajectories saved in the experience pool when learning joint action-value functions.Hence, a method for multi-agent cooperation based on Multi-agent Multi-step Dueling Network(MMDN) is proposed herein.First, action estimation and state estimation are decoupled through an independent agent network and a value network during training;additionally, the temporal-difference target is estimated via multistep learning for the entire history trajectory.Second, decentralized multi-agent cooperation policies are trained via a centralized end-to-end mode by optimizing a mixing network that approximates the joint action-value function.Experimental results show that the average winning rate of this method in six scenarios is better than those of multi-agent cooperative methods based on the Value-Decomposition Network(VDN), QMIX, QTRAN, and Counterfactual Multi-Agent(COMA) policy gradient.Additionally, it offers a higher convergence speed and better stability.

Key words: multi-agent cooperation, Deep Reinforcement Learning(DRL), value-decomposition, multi-step dueling network, action value function

摘要： 多智能体高效协作是多智能体深度强化学习的重要目标，然而多智能体决策系统中存在的环境非平稳、维数灾难等问题使得这一目标难以实现。现有值分解方法可在环境平稳性和智能体拓展性之间取得较好平衡，但忽视了智能体策略网络的重要性，并且在学习联合动作值函数时未充分利用经验池中保存的完整历史轨迹。提出一种基于多智能体多步竞争网络的多智能体协作方法，在训练过程中使用智能体网络和价值网络对智能体动作评估和环境状态评估进行解耦，同时针对整条历史轨迹完成多步学习以估计时间差分目标，通过优化近似联合动作值函数的混合网络集中且端到端地训练分散的多智能体协作策略。实验结果表明，该方法在6种场景中的平均胜率均优于基于值分解网络、单调值函数分解、值函数变换分解、反事实多智能体策略梯度的多智能体协作方法，并且具有较快的收敛速度和较好的稳定性。

关键词: 多智能体协作, 深度强化学习, 值分解, 多步竞争网络, 动作值函数

CLC Number:

TP18

LI Zifan, WANG Hao, FANG Baofu. A Method for Multi-Agent Cooperation Based on Multi-Step Dueling Network[J]. Computer Engineering, 2022, 48(5): 74-81.

厉子凡, 王浩, 方宝富. 一种基于多步竞争网络的多智能体协作方法[J]. 计算机工程, 2022, 48(5): 74-81.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0061437

http://www.ecice06.com/EN/Y2022/V48/I5/74

Figures/Tables 6

References

[1] ARULKUMARAN K, DEISENROTH M P.Deep reinforcement learning:a brief survey[J].IEEE Signal Processing Magazine, 2017, 34(6):26-38.
[2] HÜTTENRAUCH M, ŠOŠIĆ A, NEUMANN G.Guided deep reinforcement learning for swarm systems[EB/OL].[2021-03-17].https://arxiv.org/abs/1709.06011.
[3] CHU T S, WANG J, CODECÀ L, et al.Multi-agent deep reinforcement learning for large-scale traffic signal control[J].IEEE Transactions on Intelligent Transportation Systems, 2020, 21(3):1086-1095.
[4] 徐西建, 王子磊, 奚宏生.基于深度强化学习的流媒体边缘云会话调度策略[J].计算机工程, 2019, 45(5):237-242, 248. XU X J, WANG Z L, XI H S.Session scheduling strategy for streaming media edge cloud based on deep reinforcement learning[J].Computer Engineering, 2019, 45(5):237-242, 248.(in Chinese)
[5] SALLAB A E, ABDOU M, PEROT E, et al.Deep reinforcement learning framework for autonomous driving[J].Electronic Imaging, 2017, 29(19):70-76.
[6] 韩向敏, 鲍泓, 梁军, 等.一种基于深度强化学习的自适应巡航控制算法[J].计算机工程, 2018, 44(7):32-35, 41. HAN X M, BAO H, LIANG J, et al.An adaptive cruise control algorithm based on deep reinforcement learning[J].Computer Engineering, 2018, 44(7):32-35, 41.(in Chinese)
[7] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al.Grandmaster level in StarCraft II using multi-agent reinforcement learning[J].Nature, 2019, 575(7782):350-354.
[8] HERNANDEZ-LEAL P, KARTAL B, TAYLOR M E.A survey and critique of multiagent deep reinforcement learning[J].Autonomous Agents and Multi-Agent Systems, 2019, 33(6):750-797.
[9] TAMPUU A, MATIISEN T, KODELJA D, et al.Multiagent cooperation and competition with deep reinforcement learning[J].PLoS One, 2017, 12(4):17-23.
[10] DE WITT C S, GUPTA T, MAKOVIICHUK D, et al.Is independent learning all you need in the starcraft multi-agent challenge?[EB/OL].[2021-03-17].http://arxiv.org/abs/2011.09533.
[11] GUPTA J K, EGOROV M, KOCHENDERFER M.Cooperative multi-agent control using deep reinforcement learning[M].Berlin, Germany:Springer, 2017:66-83.
[12] LOWE R, WU Y, TAMAR A, et al.Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2017:6382-6393.
[13] FOERSTER J, FARQUHAR G, AFOURAS T, et al.Counterfactual multi-agent policy gradients[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2018:2974-2982.
[14] IQBAL S, SHA F.Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning.New York, USA:ACM Press, 2019:2961-2970.
[15] SUNEHAG P, LEVER G, GRUSLYS A, et al.Value-decomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems.Berlin, Germany:Springer, 2018:2085-2087.
[16] RASHID T, SAMVELYAN M, WITT C S, et al.QMIX:monotonic value function factorisation for deep multi-agent reinforcement learning[C]//Proceedings of the 35th International Conference on Machine Learning.New York, USA:ACM Press, 2018:4292-4301.
[17] SON K, KIM D, KANG W J, et al.QTRAN:learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning.New York, USA:ACM Press, 2019:5887-5896.
[18] OLIEHOEK F A, SPAAN M T J, VLASSIS N.Optimal and approximate Q-value functions for decentralized POMDPs[J].Journal of Artificial Intelligence Research, 2008, 32:289-353.
[19] MNIH V, KAVUKCUOGLU K, SILVER D, et al.Human-level control through deep reinforcement learning[J].Nature, 2015, 518(7540):529-533.
[20] MNIH V, BADIA A P, MIRZA M, et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33nd International Conference on Machine Learning.New York, USA:ACM Press, 2016:1928-1937.
[21] LILLICRAP T P, HUNT J J, PRITZEL A, et al.Continuous control with deep reinforcement learning[EB/OL].[2021-03-17].https://arxiv.org/abs/1509.02971.
[22] SCHULMAN J, LEVINE S, MORITZ P, et al.Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning.New York, USA:ACM Press, 2015:1889-1897.
[23] HA D, DAI A M, LE Q V.Hypernetworks[C]//Proceedings of the 5th International Conference on Learning Representations.Amherst, USA:[s.n.], 2017:1-8.
[24] OLIEHOEK F A, AMATO C.A concise introduction to decentralized POMDPs[M].Berlin, Germany:Springer, 2016.
[25] HESSEL M, MODAYIL J, VAN HASSELT H, et al.Rainbow:combining improvements in deep reinforcement learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2018:3215-3222.
[26] JAAKKOLA T, JORDAN M I, SINGH S P.On the convergence of stochastic iterative dynamic programming algorithms[J].Neural Computation, 1994, 6(6):1185-1201.
[27] SUTTON R S, BARTO A G.Reinforcement learning:an introduction[J].IEEE Transactions on Neural Networks, 2005, 16(1):285-286.
[28] KEARNS M J, SINGH S P.Bias-variance error bounds for temporal difference updates[C]//Proceedings of the 13th Annual Conference on Computational Learning Theory.San Francisco, USA:Morgan Kaufmann, 2000:142-147.
[29] WANG Z, SCHAUL T, HESSEL M, et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning.New York, USA:ACM Press, 2016:1995-2003.
[30] SAMVELYAN M, RASHID T, DE WITT C S, et al.The starcraft multi-agent challenge[EB/OL].[2021-03-17].http://arxiv.org/abs/1902.04043.
[31] MAHAJAN A, RASHID T, SAMVELYAN M, et al.MAVEN:multi-agent variational exploration[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.Cambridge, USA:MIT Press, 2019:7611-7622.

Please choose a citation manager

Content to export