作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于渐进式深度强化学习的六自由度空战决策方法

  • 发布日期:2025-08-27

A 6-DOF Air Combat Decision-Making Method Based on Progressive Deep Reinforcement Learning

  • Published:2025-08-27

摘要: 六自由度无人机空战是一个极具挑战性的场景,包含高维连续状态和动作空间以及非线性动力学。针对上述场景,提出了一种渐进式多目标策略优化算法(Progressive Multi-objective Strategy Optimization, PMSO),该算法通过动态调整动作空间的粒度并结合多目标奖励函数来提升策略学习效果。针对连续动作空间维度高、搜索空间过大导致的算法决策困难甚至难以学习到有效策略的问题,设计了渐进式离散化机制,该机制初始阶段采用较粗粒度的离散动作指令以快速探索策略空间,旨在利用动作指令控制效果的局部相似性来减小动作搜索空间;随着训练迭代和任务难度增加,动作指令的离散化程度逐渐缩小,从而保留了动作指令的控制精度。针对空战任务中普遍存在的稀疏奖励问题,设计了包括角度、距离和高度的多目标奖励函数,通过这些奖励的协同来引导算法更好地理解当前行为对空战任务的影响,加快收敛速度。在涵盖优势、均势、劣势的随机空战场景的仿真实验中,本文提出的PMSO算法都能快速收敛并学习到有效的空战策略,其收敛速度和学习到的策略的效果优于现有的空战算法。

Abstract: The six degrees of freedom (6-DOF) unmanned aerial vehicle (UAV) air combat scenario is highly challenging, involving high-dimensional continuous state and action space, as well as nonlinear dynamics. To address the difficulties of decision-making in such scenarios, this paper proposes a Progressive Multi-objective Strategy Optimization (PMSO) algorithm, which enhances policy learning performance by dynamically adjusting the granularity of the action space and incorporating multi-objective reward functions. To overcome the challenges caused by the high dimensionality of the continuous action space and the excessively large search space, which often result in difficulty in decision-making or the failure to learn effective strategies, this paper designs a progressive discretization mechanism. In the initial stage, coarse-grained discrete action commands are adopted to facilitate rapid exploration of the strategy space, leveraging the local similarity in the control effectiveness of action commands to reduce the search space. As training iterations progress and task difficulty increases, the degree of discretization gradually decreases, thereby preserving the precision of action control. To address the sparse reward problem prevalent in air combat tasks, multi-objective reward functions are designed, incorporating objectives such as angle, distance and altitude. The coordination of these reward functions guides the algorithm to better understand the impact of current action commands on the overall air combat task, accelerating convergence. Simulation experiments in randomized air combat scenarios, including advantageous, neutral, and disadvantaged situations, demonstrate that the proposed PMSO algorithm achieves rapid convergence and learns effective air combat strategies. The convergence speed and the performance of the learned strategies outperform existing air combat algorithms.