作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (4): 66-74. doi: 10.19678/j.issn.1000-3428.0069097

• 人工智能与模式识别 • 上一篇    下一篇

基于因果掩码的因果强化学习算法

黄思扬1, 蔡瑞初1,*(), 乔杰1, 郝志峰2   

  1. 1. 广东工业大学计算机学院, 广东 广州 510006
    2. 汕头大学理学院, 广东 汕头 515063
  • 收稿日期:2023-12-26 出版日期:2025-04-15 发布日期:2024-05-22
  • 通讯作者: 蔡瑞初
  • 基金资助:
    国家自然科学基金(61876043); 国家自然科学基金(61976052); 国家自然科学基金(62206064); 科技创新2030—“新一代人工智能”重大项目(2021ZD0111501); 国家优秀青年科学基金(62122022)

Causal Reinforcement Learning Algorithm Based on Causal Mask

HUANG Siyang1, CAI Ruichu1,*(), QIAO Jie1, HAO Zhifeng2   

  1. 1. School of Computing, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
    2. School of Science, Shantou University, Shantou 515063, Guangdong, China
  • Received:2023-12-26 Online:2025-04-15 Published:2024-05-22
  • Contact: CAI Ruichu

摘要:

针对序列上连续决策问题, 诸如故障告警根因定位问题, 强化学习(RL)已经成为一种重要的解决方法, 但现有强化学习方法存在样本效率低、探索成本高昂等问题, 阻碍了其广泛应用。研究表明, 引入因果知识为提升强化学习智能体的决策可解释性和样本效率提供了巨大潜力。然而, 现有方法大多停留在隐式建模环境因果关系, 未能直接利用因果结构知识。提出一种两阶段因果强化学习算法, 第一阶段基于观察数据用因果模型对环境变量进行显式建模, 第二阶段基于所学因果结构进一步构造因果掩码来增强策略, 帮助缩小决策空间, 减小探索风险。由于缺乏允许直接因果推理的公共基准环境, 在模拟故障告警环境中设计根因定位任务, 并在不同维度的环境中通过对比实验证明所提算法的有效性和鲁棒性。实验结果表明, 所提算法在累积奖励指标上相比现有的主流强化学习SAC (Soft Actor-Critic)算法, 在低维环境下提升了13%, 在高维环境下提升了79%, 而且仅需少数的探索即可收敛策略, 样本效率在低维和高维环境下分别提升了27%和52%。

关键词: 强化学习, 因果发现, 因果强化学习, 因果掩码, 策略学习

Abstract:

Reinforcement Learning (RL) has become an important solution to sequential continuous decision-making problems, such as root cause localization of fault alarms; however, existing methods suffer from low sample efficiency and high exploration costs that hinder their wide application. Studies have shown that introducing causal knowledge offers great potential for improving decision interpretability and sampling efficiency of RL agents. However, most existing methods do not implicitly model causal relationships and fail to directly utilize the knowledge of causal structures. Therefore, this study proposes a two-stage causal RL algorithm, whereby the first stage explicitly models environmental variables using causal models based on observational data, and the second stage constructs causal masks based on the learned causal structure to augment policy, which helps narrow the decision space and reduce exploration risks. Considering the lack of public benchmark environments that allow direct causal reasoning, this study designs a root cause localization task in a simulated fault alarm environment and demonstrates the effectiveness and robustness of the proposed algorithm through comparative experiments in environments of different dimensions. The experimental results showed that in a low-dimensional environment, the proposed algorithm improved indicator of cumulative rewards by 13% with respect to the existing mainstream RL Soft Actor-Critic (SAC) algorithm, and in a high-dimensional environment by 79%, requiring only a few explorations for the policy to converge. The sample efficiency increased by 27% and 52% in low- and high-dimensional environments, respectively.

Key words: Reinforcement Learning (RL), causal discovery, causal RL, causal mask, policy learning