作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 324-336. doi: 10.19678/j.issn.1000-3428.0069621

• 开发研究与工程应用 • 上一篇    下一篇

基于深度强化学习的无人机空战机动决策方法

张祥瑞1, 谭泰1, 李辉1,2,*(), 张建伟1,2, 黎博文1   

  1. 1. 四川大学计算机学院, 四川 成都 610065
    2. 四川大学视觉合成图形图像技术国防重点学科实验室, 四川 成都 610065
  • 收稿日期:2024-03-19 修回日期:2024-06-24 出版日期:2025-12-15 发布日期:2024-08-20
  • 通讯作者: 李辉
  • 基金资助:
    国家自然科学基金联合基金项目(U20A20161)

Aerial Combat Maneuver Decision Method for Unmanned Aerial Vehicles Based on Deep Reinforcement Learning

ZHANG Xiangrui1, TAN Tai1, LI Hui1,2,*(), ZHANG Jianwei1,2, LI Bowen1   

  1. 1. College of Computer Science, Sichuan University, Chengdu 610065, Sichuan, China
    2. Nation Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, Sichuan, China
  • Received:2024-03-19 Revised:2024-06-24 Online:2025-12-15 Published:2024-08-20
  • Contact: LI Hui

摘要:

无人机(UAV)近距空战环境复杂, 敌机机动高速变化, 针对该环境下六自由度无人机空战自主机动决策困难的问题, 提出一种分层框架下基于双重奖励的近端策略优化(DR-PPO)无人机自主引导算法。传统深度强化学习方法在解决六自由度无人机空战任务时, 因动作空间维度高、探索空间大, 导致算法收敛速度慢甚至难以学习到决策的问题, 设计无人机空战机动决策分层框架, 将空战任务分为决策与控制两个子问题, DR-PPO算法作为决策层生成高层决策, 通过双重奖励引导无人机更好地理解正确的空战行为, 解决空战任务中奖励稀疏难以收敛的问题; 比例积分微分(PID)算法作为控制层, 生成无人机基本控制律, 将高层决策转换并输出原始控制指令, 使DR-PPO算法更专注于无人机机动决策层面, 缩短飞行控制的探索时间, 加快算法的收敛速度。仿真结果表明, 在典型的空战实验场景中, 分层框架下的DR-PPO算法能够缩短探索时间, 避免陷入局部最优, 有效引导无人机在不同态势下自主学习机动决策并快速到达优势位置, 完成空战任务, 其收敛效果与机动决策表现均显著优于传统深度强化学习方法下的DR-PPO算法及PPO算法, 有效提高了无人机作战能力, 并通过复杂多场景测试验证该算法具有良好的泛化性。

关键词: 无人机, 近端策略优化算法, 六自由度, 双重奖励, 分层框架

Abstract:

Unmanned Aerial Vehicle (UAV) close-range air combat environments are complex, with rapid changes in enemy aircraft maneuvers. Considering the difficulty of autonomous maneuvering decision-making for UAVs with six Degrees of Freedom (6DOF) in such environments, a hierarchical framework-based Proximal Policy Optimization with Dual Reward (DR-PPO) UAV autonomous guidance algorithm is proposed. While performing 6-DOF UAV air combat tasks, traditional deep reinforcement learning methods are hindered by slow convergence speed and difficulty in learning decisions, owing to the high action space dimension and large exploration space. To address this issue, this study designs a hierarchical decision-making framework for UAV air combat maneuvers and divides the air combat task into two sub-problems: decision and control. The DR-PPO algorithm is used as the decision-making layer to generate high-level decisions. The dual reward guides the UAVs to better understand the correct air combat behavior and solves the problem of sparse rewards in air combat missions. As the control layer, the Proportional Integral Differential (PID) algorithm generates the basic control law of the UAV, converts high-level decisions to output original control instructions, and enables the DR-PPO algorithm to focus more on the decision-making level of UAV maneuvering, thereby reducing the exploration time for solving flight control and accelerating the convergence of the algorithm. In a simulation of typical air combat scenarios, the DR-PPO algorithm under the hierarchical framework shortens the exploration time, avoids falling into the local optima, effectively guides the UAV to autonomously learn maneuvering decisions under different situations, quickly reaches an advantageous position, and completes the air combat task. Its convergence effect and maneuver decision-making performance are significantly better than those of traditional deep reinforcement learning methods, DR-PPO and PPO, effectively improving the combat ability of the UAV. Complex multi-scenario tests verify that the algorithm has good generalization.

Key words: Unmanned Aerial Vehicle (UAV), Proximal Policy Optimization (PPO) algorithm, six degrees of freedom, Dual Reward (DR), hierarchical framework