Intelligent Wargame Deduction Decision Method Based on Deep Reinforcement Learning

doi:10.19678/j.issn.1000-3428.0067067

Abstract

Abstract:

Wargame deduction is an important method for cultivating modern military commanders. Introducing artificial intelligence technology in wargame deduction can simplify organizational processes and improve deduction efficiency. Owing to the complex situational information and incomplete inference information, intelligent wargame based on machine learning often reduces the sample efficiency of autonomous decision-making models. This paper proposes an intelligent wargame deduction decision-making method based on deep reinforcement learning. In response to the efficiency issue of intelligent wargame deduction and combat decision-making, a baseline is introduced into the strategy network, and the training of the policy network is accelerated. Subsequently, derivation and proof are presented, and a method for updating the parameters of the policy network after adding the baseline is proposed. The process of introducing the state-value function in the wargame deduction environment into the model is analyzed. Construct a Low Advantage Policy-Value Network(LAPVN) model and its training framework for wargame deduction under traditional policy-value networks, and construct the model using battlefield situational awareness methods. In a wargame combat experimental environment that approximately conforms to military operational rules, the traditional policy-value network and LAPVN are compared for training. In 400 self-game training sessions, the loss value of the LAPVN model decreases from 5.3 to 2.3, and the convergence is faster than that of the traditional policy-value network. The KL divergence of the LAPVN model is very close to zero during the training process.

Key words: wargame, situation awareness, deep reinforcement learning, Convoluation Neural Network(CNN), actor-critic method

摘要：

兵棋推演是培养现代军事指挥员的重要方法，将人工智能技术引入到兵棋推演中可简化组织流程，提升推演效益。基于机器学习的智能兵棋常因态势信息过于复杂以及推演本身信息不完整，导致自主决策模型的样本决策效率降低。提出一种基于深度强化学习的智能兵棋推演决策方法。针对智能兵棋推演作战决策的效率问题，在策略网络中引入基准线，并加快策略网络训练，随后进行推导证明，提出加入基准线后策略网络参数的更新方法，分析将兵棋推演环境中的状态-价值函数引入到模型的过程。构建低优势策略-价值网络模型及其训练框架，在传统策略-价值网络下用于兵棋推演，结合战场态势感知方法对模型进行构建。实验结果表明，在近似符合军事作战规则的兵棋作战实验环境中，将传统策略-价值网络和低优势策略-价值网络进行对比训练，在400次的自博弈训练中，低优势策略-价值网络模型的损失值从5.3下降到2.3，且收敛速度优于传统策略-价值网络，低优势策略-价值网络模型的KL散度在训练过程中趋近于0。

关键词: 兵棋, 态势感知, 深度强化学习, 卷积神经网络, 演员-评论家方法

Shui HU. Intelligent Wargame Deduction Decision Method Based on Deep Reinforcement Learning[J]. Computer Engineering, 2023, 49(9): 303-312.

胡水. 基于深度强化学习的智能兵棋推演决策方法[J]. 计算机工程, 2023, 49(9): 303-312.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067067

https://www.ecice06.com/EN/Y2023/V49/I9/303

Figures/Tables 12

Fig.1 Framework of intelligent wargame deduction decision based on deep reinforcement learning

Fig.2 Training framework of low advantage policy-value network

Fig.3 Input matrix of low advantage policy-value network

Fig.4 Schematic diagram of battlefield environment

Fig.5 Decrease trend of loss values between low advantage policy-value network and traditional policy-value network

Fig.6 Trend of KL divergence

Fig.7 Change trend of win-loss ratio of MCTS

Fig.8 Game adversaries among different networks

References 27

1	黄柯棣, 刘宝全, 黄健, 等. 作战仿真技术综述[C]//全球化制造高级论坛暨21世纪仿真技术研讨会论文集. 北京: 中国系统仿真学会, 2004: 80-89.
	HUANG K D, LIU B Q, HUANG J, et al. A survey of military simulation technologies[C]//Global Manufacturing Advanced Forum and 21st Century Simulation Technology Seminar. Beijing: China System Simulation Society, 2004: 80-89.
2	赵慧赟, 张东戈. 战场指挥控制时效性影响因素分析. 军事运筹与系统工程, 2015, 29(2): 12-16, 49. URL
	ZHAO H Y, ZHANG D G. Analysis of influencing factors on timeliness of battlefield command and control. Military Operations Research and Assessment, 2015, 29(2): 12-16, 49. URL
3	尹强, 叶雄兵. 作战筹划方法研究. 国防科技, 2016, 37(1): 95- 99. URL
	YIN Q, YE X B. The initially research for the method of operational design. National Defense Science & Technology, 2016, 37(1): 95- 99. URL
4	曹占广, 陶帅, 胡晓峰, 等. 国外兵棋推演及系统研究进展. 系统仿真学报, 2021, 33(9): 2059- 2065. URL
	CAO Z G, TAO S, HU X F, et al. Abroad wargaming deduction and system research. Journal of System Simulation, 2021, 33(9): 2059- 2065. URL
5	刘海洋, 唐宇波, 胡晓峰, 等. 基于兵棋推演的联合作战方案评估框架研究. 系统仿真学报, 2018, 30(11): 4115-4122, 4131. URL
	LIU H Y, TANG Y B, HU X F, et al. Research on evaluation framework of COA based on wargaming. Journal of System Simulation, 2018, 30(11): 4115-4122, 4131. URL
6	SURDU J R. The deep green concept[C]//Processings of the 2008 Spring Simulation Multiconference. Berlin, Germany: Springer, 2008: 623-631.
7	李承兴, 高桂清, 鞠金鑫, 等. 基于人工智能深度增强学习的装备维修保障兵棋研究. 兵器装备工程学报, 2018, 39(2): 61- 65. URL
	LI C X, GAO G Q, JU J X, et al. Study on equipment maintenance and security based on artificial intelligence depth enhancement. Journal of Ordnance Equipment Engineering, 2018, 39(2): 61- 65. URL
8	张晓海, 操新文, 耿松涛, 等. 基于深度学习的军事辅助决策智能化研究. 兵器装备工程学报, 2018, 39(10): 162- 167. URL
	ZHANG X H, CAO X W, GENG S T, et al. Research on intelligence of military auxiliary decision-making system based on deep learning. Journal of Ordnance Equipment Engineering, 2018, 39(10): 162- 167. URL
9	杨思明, 单征, 丁煜, 等. 深度强化学习研究综述. 计算机工程, 2021, 47(12): 19- 29. URL
	YANG S M, SHAN Z, DING Y, et al. Survey of research on deep reinforcement learning. Computer Engineering, 2021, 47(12): 19- 29. URL
10	徐佳乐, 张海东, 赵东海, 等. 基于卷积神经网络的陆战兵棋战术机动策略学习. 系统仿真学报, 2022, 34(10): 2181- 2193. URL
	XU J L, ZHANG H D, ZHAO D H, et al. Learning tactics and maneuvering strategies of marine chess based on convolutional neural network. Journal of System Simulation, 2022, 34(10): 2181- 2193. URL
11	SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=029AD004E9AAC8429FC7DBCA0844CF07?doi=10.1.1.79.5189&rep=rep1&type=pdf.
12	刘全, 翟建伟, 章宗长, 等. 深度强化学习综述. 计算机学报, 2018, 41(1): 1- 27. URL
	LIU Q, ZHAI J W, ZHANG Z Z, et al. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1- 27. URL
13	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4): 229- 256.
14	RIEDMILLER M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]//Proceedings of European Conference on Machine Learning. Berlin, Germany: Springer, 2005: 317-328.
15	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529- 533.
16	SUTTON R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9- 44.
17	CAO J Q, LIU Q, ZHU F, et al. Gradient temporal-difference learning for off-policy evaluation using emphatic weightings. Information Sciences, 2021, 580, 311- 330.
18	YANG Z Y, MERRICK K, JIN L W, et al. Hierarchical deep reinforcement learning for continuous action control. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5174- 5184.
19	姚桐, 王越, 董岩, 等. 深度强化学习在作战任务规划中的应用. 飞航导弹, 2020,(4): 16- 21. URL
	YAO T, WANG Y, DONG Y, et al. Application of deep reinforcement learning in operational mission planning. Aerospace Technology, 2020,(4): 16- 21. URL
20	MNIH V, GREGORY K. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM Press, 2016: 1-10.
21	ZHAO T T, HACHIYA H, NIU G, et al. Analysis and improvement of policy gradient estimation. Neural Networks, 2012, 26, 118- 129.
22	BRITTAIN M, BERTRAM J R, YANG X X, et al. Prioritized sequence experience replay[EB/OL]. [2023-01-28]. https://arxiv.org/abs/1905.12726.
23	SUTTON R S, BARTO A G. Reinforcement learning: an introduction[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=000A1763D716FE87AFA47A6EFBF82AA1?doi=10.1.1.32.7692&rep=rep1&type=pdf.
24	SCHULMAN J, LEVINE S, MORITZ P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM Press, 2015: 1889-1897.
25	KOCSIS L, SZEPESVÁRI C. Bandit based monte-carlo planning[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=00B1B017DFD702ADF5F9FB8A6CD6B5EE?doi=10.1.1.102.1296&rep=rep1&type=pdf.
26	DAVID S, AJA H, MADDISON CHRIS J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484- 489.
27	李昊. 五子棋人机博弈算法优化研究与实现[D]. 大连: 大连海事大学, 2020.
	LI H. Research and implementation of man-machine game algorithm optimization for gobang[D]. Dalian: Dalian Maritime University, 2020. (in Chinese)

[1]	Qiong SHI, Hui DUAN, Zhibin SHI. Trusted Task Offloading Scheme Based on Deep Reinforcement Learning [J]. Computer Engineering, 2024, 50(8): 142-152.
[2]	FU Mingjian, GUO Fuqiang. Research on Decision-Making at Intersection Without Traffic Lights Based on Deep Reinforcement Learning [J]. Computer Engineering, 2024, 50(5): 91-99.
[3]	SUN Wenjie, LI Zongmin, SUN Haomiao. Multi-Agent Reinforcement Learning Value Function Factorization Approach Based on Graph Neural Network [J]. Computer Engineering, 2024, 50(5): 62-70.
[4]	Haijun DU, Su YU. Dynamic Obstacle Avoidance for Service Robots Based on Spatio-Temporal Graph Attention Network [J]. Computer Engineering, 2024, 50(2): 105-112.
[5]	Ziyue CAI, Beihai TAN, Rong YU, Xumin HUANG, Siming WANG. Dynamic Blockchain Sharding for 6G Internet of Things Devices Collaboration [J]. Computer Engineering, 2024, 50(1): 50-59.
[6]	Linghui KONG, Zheheng RAO, Yanyan XU, Shaoming PAN. Intelligent Routing Algorithm for Wireless Networks Based on Deep Reinforcement Learning [J]. Computer Engineering, 2023, 49(9): 199-207, 216.
[7]	Guanying ZHANG, Peng YI, Dan LI, Di ZHU, Ming MAO. Service Function Chain Deployment Method for Large-Scale Network [J]. Computer Engineering, 2023, 49(8): 122-129.
[8]	Lijiao CAI, Jin QIN, Shuang CHEN. Reinforcement Exploration Method to Keep Away from Old Areas and Avoid Loops [J]. Computer Engineering, 2023, 49(7): 118-124.
[9]	Jing MEI, Longbao DAI, Zhao TONG, Xin DENG, Jiake WANG. Adaptive Offloading Algorithm Based on Lyapunov Optimization Under Resource Constraints [J]. Computer Engineering, 2023, 49(7): 34-46.
[10]	LI Qiang, YI Jinhui, DU Tingting, WANG Shengchun. Dependent Task Offloading and Resource Allocation Based on A3C in Mobile Edge Computing [J]. Computer Engineering, 2023, 49(6): 42-52.
[11]	RAO Dongning, LUO Nanyue. Stacker Scheduling and Repository Location Recommendation Based on Multi-Task Reinforcement Learning [J]. Computer Engineering, 2023, 49(2): 279-287,295.
[12]	Qiru LI, Xia GENG. Robot Path Planning Based on Improved DQN Algorithm [J]. Computer Engineering, 2023, 49(12): 111-120.
[13]	SONG Jian, WANG Zilei. Multi-Goal Multi-Agent Deep Reinforcement Learning Method Based on Value Decomposition [J]. Computer Engineering, 2023, 49(1): 31-40.
[14]	ZHAO Yinfu, FENG Zhengyong. Fast Training Method for Manipulator Control Based on Deep Reinforcement Learning [J]. Computer Engineering, 2022, 48(8): 113-120.
[15]	LI Zifan, WANG Hao, FANG Baofu. A Method for Multi-Agent Cooperation Based on Multi-Step Dueling Network [J]. Computer Engineering, 2022, 48(5): 74-81.

Please choose a citation manager

Content to export