作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (9): 303-312. doi: 10.19678/j.issn.1000-3428.0067067

• 开发研究与工程应用 • 上一篇    下一篇

基于深度强化学习的智能兵棋推演决策方法

胡水   

  1. 中国人民解放军陆军指挥学院, 南京 210000
  • 收稿日期:2023-03-01 出版日期:2023-09-15 发布日期:2023-07-28
  • 作者简介:

    胡水(1983—),男,副教授、博士,主研方向为智能兵棋推演

Intelligent Wargame Deduction Decision Method Based on Deep Reinforcement Learning

Shui HU   

  1. Army Command Academy of People's Liberation Army, Nanjing 210000, China
  • Received:2023-03-01 Online:2023-09-15 Published:2023-07-28

摘要:

兵棋推演是培养现代军事指挥员的重要方法,将人工智能技术引入到兵棋推演中可简化组织流程,提升推演效益。基于机器学习的智能兵棋常因态势信息过于复杂以及推演本身信息不完整,导致自主决策模型的样本决策效率降低。提出一种基于深度强化学习的智能兵棋推演决策方法。针对智能兵棋推演作战决策的效率问题,在策略网络中引入基准线,并加快策略网络训练,随后进行推导证明,提出加入基准线后策略网络参数的更新方法,分析将兵棋推演环境中的状态-价值函数引入到模型的过程。构建低优势策略-价值网络模型及其训练框架,在传统策略-价值网络下用于兵棋推演,结合战场态势感知方法对模型进行构建。实验结果表明,在近似符合军事作战规则的兵棋作战实验环境中,将传统策略-价值网络和低优势策略-价值网络进行对比训练,在400次的自博弈训练中,低优势策略-价值网络模型的损失值从5.3下降到2.3,且收敛速度优于传统策略-价值网络,低优势策略-价值网络模型的KL散度在训练过程中趋近于0。

关键词: 兵棋, 态势感知, 深度强化学习, 卷积神经网络, 演员-评论家方法

Abstract:

Wargame deduction is an important method for cultivating modern military commanders. Introducing artificial intelligence technology in wargame deduction can simplify organizational processes and improve deduction efficiency. Owing to the complex situational information and incomplete inference information, intelligent wargame based on machine learning often reduces the sample efficiency of autonomous decision-making models. This paper proposes an intelligent wargame deduction decision-making method based on deep reinforcement learning. In response to the efficiency issue of intelligent wargame deduction and combat decision-making, a baseline is introduced into the strategy network, and the training of the policy network is accelerated. Subsequently, derivation and proof are presented, and a method for updating the parameters of the policy network after adding the baseline is proposed. The process of introducing the state-value function in the wargame deduction environment into the model is analyzed. Construct a Low Advantage Policy-Value Network(LAPVN) model and its training framework for wargame deduction under traditional policy-value networks, and construct the model using battlefield situational awareness methods. In a wargame combat experimental environment that approximately conforms to military operational rules, the traditional policy-value network and LAPVN are compared for training. In 400 self-game training sessions, the loss value of the LAPVN model decreases from 5.3 to 2.3, and the convergence is faster than that of the traditional policy-value network. The KL divergence of the LAPVN model is very close to zero during the training process.

Key words: wargame, situation awareness, deep reinforcement learning, Convoluation Neural Network(CNN), actor-critic method