基于深度强化学习的智能兵棋推演决策方法

doi:10.19678/j.issn.1000-3428.0067067

摘要/Abstract

摘要：

兵棋推演是培养现代军事指挥员的重要方法，将人工智能技术引入到兵棋推演中可简化组织流程，提升推演效益。基于机器学习的智能兵棋常因态势信息过于复杂以及推演本身信息不完整，导致自主决策模型的样本决策效率降低。提出一种基于深度强化学习的智能兵棋推演决策方法。针对智能兵棋推演作战决策的效率问题，在策略网络中引入基准线，并加快策略网络训练，随后进行推导证明，提出加入基准线后策略网络参数的更新方法，分析将兵棋推演环境中的状态-价值函数引入到模型的过程。构建低优势策略-价值网络模型及其训练框架，在传统策略-价值网络下用于兵棋推演，结合战场态势感知方法对模型进行构建。实验结果表明，在近似符合军事作战规则的兵棋作战实验环境中，将传统策略-价值网络和低优势策略-价值网络进行对比训练，在400次的自博弈训练中，低优势策略-价值网络模型的损失值从5.3下降到2.3，且收敛速度优于传统策略-价值网络，低优势策略-价值网络模型的KL散度在训练过程中趋近于0。

关键词: 兵棋, 态势感知, 深度强化学习, 卷积神经网络, 演员-评论家方法

Abstract:

Wargame deduction is an important method for cultivating modern military commanders. Introducing artificial intelligence technology in wargame deduction can simplify organizational processes and improve deduction efficiency. Owing to the complex situational information and incomplete inference information, intelligent wargame based on machine learning often reduces the sample efficiency of autonomous decision-making models. This paper proposes an intelligent wargame deduction decision-making method based on deep reinforcement learning. In response to the efficiency issue of intelligent wargame deduction and combat decision-making, a baseline is introduced into the strategy network, and the training of the policy network is accelerated. Subsequently, derivation and proof are presented, and a method for updating the parameters of the policy network after adding the baseline is proposed. The process of introducing the state-value function in the wargame deduction environment into the model is analyzed. Construct a Low Advantage Policy-Value Network(LAPVN) model and its training framework for wargame deduction under traditional policy-value networks, and construct the model using battlefield situational awareness methods. In a wargame combat experimental environment that approximately conforms to military operational rules, the traditional policy-value network and LAPVN are compared for training. In 400 self-game training sessions, the loss value of the LAPVN model decreases from 5.3 to 2.3, and the convergence is faster than that of the traditional policy-value network. The KL divergence of the LAPVN model is very close to zero during the training process.

Key words: wargame, situation awareness, deep reinforcement learning, Convoluation Neural Network(CNN), actor-critic method

胡水. 基于深度强化学习的智能兵棋推演决策方法[J]. 计算机工程, 2023, 49(9): 303-312.

Shui HU. Intelligent Wargame Deduction Decision Method Based on Deep Reinforcement Learning[J]. Computer Engineering, 2023, 49(9): 303-312.

http://www.ecice06.com/CN/Y2023/V49/I9/303

图/表 12

图1 基于深度强化学习的智能兵棋推演决策框架

Fig.1 Framework of intelligent wargame deduction decision based on deep reinforcement learning

图2 低优势策略-价值网络训练框架

Fig.2 Training framework of low advantage policy-value network

图3 低优势策略-价值网络的输入矩阵

Fig.3 Input matrix of low advantage policy-value network

图4 战场环境示意图

Fig.4 Schematic diagram of battlefield environment

图5 低优势策略-价值网络和传统策略-价值网络的损失值下降趋势

Fig.5 Decrease trend of loss values between low advantage policy-value network and traditional policy-value network

图6 KL散度趋势

Fig.6 Trend of KL divergence

图7 MCTS的胜负率变化趋势

Fig.7 Change trend of win-loss ratio of MCTS

图8 不同网络的博弈对抗

Fig.8 Game adversaries among different networks

参考文献 27

1	黄柯棣, 刘宝全, 黄健, 等. 作战仿真技术综述[C]//全球化制造高级论坛暨21世纪仿真技术研讨会论文集. 北京: 中国系统仿真学会, 2004: 80-89.
	HUANG K D, LIU B Q, HUANG J, et al. A survey of military simulation technologies[C]//Global Manufacturing Advanced Forum and 21st Century Simulation Technology Seminar. Beijing: China System Simulation Society, 2004: 80-89.
2	赵慧赟, 张东戈. 战场指挥控制时效性影响因素分析. 军事运筹与系统工程, 2015, 29(2): 12-16, 49. URL
	ZHAO H Y, ZHANG D G. Analysis of influencing factors on timeliness of battlefield command and control. Military Operations Research and Assessment, 2015, 29(2): 12-16, 49. URL
3	尹强, 叶雄兵. 作战筹划方法研究. 国防科技, 2016, 37(1): 95- 99. URL
	YIN Q, YE X B. The initially research for the method of operational design. National Defense Science & Technology, 2016, 37(1): 95- 99. URL
4	曹占广, 陶帅, 胡晓峰, 等. 国外兵棋推演及系统研究进展. 系统仿真学报, 2021, 33(9): 2059- 2065. URL
	CAO Z G, TAO S, HU X F, et al. Abroad wargaming deduction and system research. Journal of System Simulation, 2021, 33(9): 2059- 2065. URL
5	刘海洋, 唐宇波, 胡晓峰, 等. 基于兵棋推演的联合作战方案评估框架研究. 系统仿真学报, 2018, 30(11): 4115-4122, 4131. URL
	LIU H Y, TANG Y B, HU X F, et al. Research on evaluation framework of COA based on wargaming. Journal of System Simulation, 2018, 30(11): 4115-4122, 4131. URL
6	SURDU J R. The deep green concept[C]//Processings of the 2008 Spring Simulation Multiconference. Berlin, Germany: Springer, 2008: 623-631.
7	李承兴, 高桂清, 鞠金鑫, 等. 基于人工智能深度增强学习的装备维修保障兵棋研究. 兵器装备工程学报, 2018, 39(2): 61- 65. URL
	LI C X, GAO G Q, JU J X, et al. Study on equipment maintenance and security based on artificial intelligence depth enhancement. Journal of Ordnance Equipment Engineering, 2018, 39(2): 61- 65. URL
8	张晓海, 操新文, 耿松涛, 等. 基于深度学习的军事辅助决策智能化研究. 兵器装备工程学报, 2018, 39(10): 162- 167. URL
	ZHANG X H, CAO X W, GENG S T, et al. Research on intelligence of military auxiliary decision-making system based on deep learning. Journal of Ordnance Equipment Engineering, 2018, 39(10): 162- 167. URL
9	杨思明, 单征, 丁煜, 等. 深度强化学习研究综述. 计算机工程, 2021, 47(12): 19- 29. URL
	YANG S M, SHAN Z, DING Y, et al. Survey of research on deep reinforcement learning. Computer Engineering, 2021, 47(12): 19- 29. URL
10	徐佳乐, 张海东, 赵东海, 等. 基于卷积神经网络的陆战兵棋战术机动策略学习. 系统仿真学报, 2022, 34(10): 2181- 2193. URL
	XU J L, ZHANG H D, ZHAO D H, et al. Learning tactics and maneuvering strategies of marine chess based on convolutional neural network. Journal of System Simulation, 2022, 34(10): 2181- 2193. URL
11	SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=029AD004E9AAC8429FC7DBCA0844CF07?doi=10.1.1.79.5189&rep=rep1&type=pdf.
12	刘全, 翟建伟, 章宗长, 等. 深度强化学习综述. 计算机学报, 2018, 41(1): 1- 27. URL
	LIU Q, ZHAI J W, ZHANG Z Z, et al. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1- 27. URL
13	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4): 229- 256.
14	RIEDMILLER M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]//Proceedings of European Conference on Machine Learning. Berlin, Germany: Springer, 2005: 317-328.
15	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529- 533.
16	SUTTON R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9- 44.
17	CAO J Q, LIU Q, ZHU F, et al. Gradient temporal-difference learning for off-policy evaluation using emphatic weightings. Information Sciences, 2021, 580, 311- 330.
18	YANG Z Y, MERRICK K, JIN L W, et al. Hierarchical deep reinforcement learning for continuous action control. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5174- 5184.
19	姚桐, 王越, 董岩, 等. 深度强化学习在作战任务规划中的应用. 飞航导弹, 2020,(4): 16- 21. URL
	YAO T, WANG Y, DONG Y, et al. Application of deep reinforcement learning in operational mission planning. Aerospace Technology, 2020,(4): 16- 21. URL
20	MNIH V, GREGORY K. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM Press, 2016: 1-10.
21	ZHAO T T, HACHIYA H, NIU G, et al. Analysis and improvement of policy gradient estimation. Neural Networks, 2012, 26, 118- 129.
22	BRITTAIN M, BERTRAM J R, YANG X X, et al. Prioritized sequence experience replay[EB/OL]. [2023-01-28]. https://arxiv.org/abs/1905.12726.
23	SUTTON R S, BARTO A G. Reinforcement learning: an introduction[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=000A1763D716FE87AFA47A6EFBF82AA1?doi=10.1.1.32.7692&rep=rep1&type=pdf.
24	SCHULMAN J, LEVINE S, MORITZ P, et al. Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM Press, 2015: 1889-1897.
25	KOCSIS L, SZEPESVÁRI C. Bandit based monte-carlo planning[EB/OL]. [2023-01-28]. https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=00B1B017DFD702ADF5F9FB8A6CD6B5EE?doi=10.1.1.102.1296&rep=rep1&type=pdf.
26	DAVID S, AJA H, MADDISON CHRIS J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484- 489.
27	李昊. 五子棋人机博弈算法优化研究与实现[D]. 大连: 大连海事大学, 2020.
	LI H. Research and implementation of man-machine game algorithm optimization for gobang[D]. Dalian: Dalian Maritime University, 2020. (in Chinese)

[1]	朱孟栩, 张文豪, 李国洪, 顾行发, 余涛, 郑逢杰, 张丽丽, 吴俣, 邴芳飞, 唐健雄. 基于卷积神经网络的高分六号卫星多光谱图像压缩[J]. 计算机工程, 2023, 49(9): 287-294.
[2]	李现国, 李滨. 基于Transformer和多尺度CNN的图像去模糊[J]. 计算机工程, 2023, 49(9): 226-233, 245.
[3]	杜逸潇, 王红军, 李修和. 基于频谱地图的辐射源指纹定位方法研究[J]. 计算机工程, 2023, 49(9): 183-190, 198.
[4]	韩璐, 霍纬纲, 张永会, 刘涛. 基于多尺度特征融合与双注意力机制的多元时间序列预测[J]. 计算机工程, 2023, 49(9): 99-108.
[5]	孔凌辉, 饶哲恒, 徐彦彦, 潘少明. 基于深度强化学习的无线网络智能路由算法[J]. 计算机工程, 2023, 49(9): 199-207, 216.
[6]	李哲铭, 王晋东, 侯建中, 李伟, 张世华, 张恒巍. 基于显著区域优化的对抗样本攻击方法[J]. 计算机工程, 2023, 49(9): 246-255, 264.
[7]	张冠莹, 伊鹏, 李丹, 朱棣, 毛明. 面向大规模网络的服务功能链部署方法[J]. 计算机工程, 2023, 49(8): 122-129.
[8]	曹坪, 杨怀志, 薄一军, 尤嘉, 张淳杰, 李丹勇. 面向低质量裂缝图像的多知识蒸馏分类[J]. 计算机工程, 2023, 49(7): 204-213.
[9]	白明昌. 基于折叠路径聚合的属性网络节点嵌入方法[J]. 计算机工程, 2023, 49(7): 76-84.
[10]	梅晶, 戴龙宝, 童钊, 邓昕, 王嘉珂. 资源约束下基于Lyapunov优化的自适应卸载算法[J]. 计算机工程, 2023, 49(7): 34-46.
[11]	蔡丽娇, 秦进, 陈双. 远离旧区域和避免回路的强化探索方法[J]. 计算机工程, 2023, 49(7): 118-124.
[12]	李强, 仪晋辉, 杜婷婷, 王胜春. 移动边缘计算中基于A3C的依赖任务卸载与资源分配[J]. 计算机工程, 2023, 49(6): 42-52.
[13]	代祖华, 刘园园, 狄世龙. 语义增强的图神经网络方面级文本情感分析[J]. 计算机工程, 2023, 49(6): 71-80.
[14]	沈学利, 田桂源, 姜彦吉, 马琳琳. 基于双阶段Conv-Transformer的时频域语音增强算法[J]. 计算机工程, 2023, 49(6): 123-130.
[15]	丁子轩, 俞雷, 张娟, 李想, 王新宇. 基于深度残差自适应注意力网络的图像超分辨率重建[J]. 计算机工程, 2023, 49(5): 231-238.

选择文件类型/文献管理软件名称

选择包含的内容