作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (1): 31-40. doi: 10.19678/j.issn.1000-3428.0063303

• 热点与综述 • 上一篇    下一篇

基于值分解的多目标多智能体深度强化学习方法

宋健, 王子磊   

  1. 中国科学技术大学 自动化系, 合肥 230027
  • 收稿日期:2021-11-21 修回日期:2022-02-12 发布日期:2022-07-04
  • 作者简介:宋健(1996-),男,硕士研究生,主研方向为多智能体强化学习;王子磊,副教授。
  • 基金资助:
    国家自然科学基金重点项目“时空结构知识引导的视频语义分析模型及学习方法”(62176246)。

Multi-Goal Multi-Agent Deep Reinforcement Learning Method Based on Value Decomposition

SONG Jian, WANG Zilei   

  1. Department of Automation, University of Science and Technology of China, Hefei 230027, China
  • Received:2021-11-21 Revised:2022-02-12 Published:2022-07-04

摘要: 多智能体深度强化学习方法可应用于真实世界中需要多方协作的场景,是强化学习领域内的研究热点。在多目标多智能体合作场景中,各智能体之间具有复杂的合作与竞争并存的混合关系,在这些场景中应用多智能体强化学习方法时,其性能取决于该方法是否能够充分地衡量各智能体之间的关系、区分合作和竞争动作,同时也需要解决高维数据的处理以及算法效率等应用难点。针对多目标多智能体合作场景,在QMIX模型的基础上提出一种基于目标的值分解深度强化学习方法,并使用注意力机制衡量智能体之间的群体影响力,利用智能体的目标信息实现量两阶段的值分解,提升对复杂智能体关系的刻画能力,从而提高强化学习方法在多目标多智能体合作场景中的性能。实验结果表明,相比QMIX模型,该方法在星际争霸2微观操控平台上的得分与其持平,在棋盘游戏中得分平均高出4.9分,在多粒子运动环境merge和cross中得分分别平均高出25分和280.4分,且相较于主流深度强化学习方法也具有更高的得分与更好的性能表现。

关键词: 深度强化学习, 多智能体, 多目标, 值分解, 注意力机制

Abstract: Multi-agent deep reinforcement learning method can be used in scenarios that require multi-party cooperation in the real world, which remains a challenge in the field of reinforcement learning.In these scenarios, agents usually have complex relationships with each other, including both cooperation and competition.The performance of multi-agent reinforcement learning method depends on whether the method can correctly assess the relationships among agents and distinguish cooperative and competitive actions.Additionally, it also faces the efficiency problem related to high-dimensional data processing.Focused on multi-goal multi-agent cooperation scenarios, this paper proposes a deep multi-goal multi-agent reinforcement learning method, using value function factorization based on QMIX. In this method, agents' goals and the attention mechanism are utilized to measure the social influence among them, which leads to an improved characterization of complexity of their relationships.The experimental results show that the reward scores are almost the same, 4.9 higher, 25 higher and 280.4 higher than QMIX's in StarCraft 2 platform, checker game, and the merge and cross map of multi-particle environment.The method presented here shows higher scores and better performance compared to other mainstream deep reinforcement learning methods in the representative scenarios.

Key words: deep reinforcement learning, multi-agent, multi-goal, value decomposition, attention mechanism

中图分类号: