基于因果掩码的因果强化学习算法

doi:10.19678/j.issn.1000-3428.0069097

摘要/Abstract

摘要：

针对序列上连续决策问题, 诸如故障告警根因定位问题, 强化学习(RL)已经成为一种重要的解决方法, 但现有强化学习方法存在样本效率低、探索成本高昂等问题, 阻碍了其广泛应用。研究表明, 引入因果知识为提升强化学习智能体的决策可解释性和样本效率提供了巨大潜力。然而, 现有方法大多停留在隐式建模环境因果关系, 未能直接利用因果结构知识。提出一种两阶段因果强化学习算法, 第一阶段基于观察数据用因果模型对环境变量进行显式建模, 第二阶段基于所学因果结构进一步构造因果掩码来增强策略, 帮助缩小决策空间, 减小探索风险。由于缺乏允许直接因果推理的公共基准环境, 在模拟故障告警环境中设计根因定位任务, 并在不同维度的环境中通过对比实验证明所提算法的有效性和鲁棒性。实验结果表明, 所提算法在累积奖励指标上相比现有的主流强化学习SAC (Soft Actor-Critic)算法, 在低维环境下提升了13%, 在高维环境下提升了79%, 而且仅需少数的探索即可收敛策略, 样本效率在低维和高维环境下分别提升了27%和52%。

关键词: 强化学习, 因果发现, 因果强化学习, 因果掩码, 策略学习

Abstract:

Reinforcement Learning (RL) has become an important solution to sequential continuous decision-making problems, such as root cause localization of fault alarms; however, existing methods suffer from low sample efficiency and high exploration costs that hinder their wide application. Studies have shown that introducing causal knowledge offers great potential for improving decision interpretability and sampling efficiency of RL agents. However, most existing methods do not implicitly model causal relationships and fail to directly utilize the knowledge of causal structures. Therefore, this study proposes a two-stage causal RL algorithm, whereby the first stage explicitly models environmental variables using causal models based on observational data, and the second stage constructs causal masks based on the learned causal structure to augment policy, which helps narrow the decision space and reduce exploration risks. Considering the lack of public benchmark environments that allow direct causal reasoning, this study designs a root cause localization task in a simulated fault alarm environment and demonstrates the effectiveness and robustness of the proposed algorithm through comparative experiments in environments of different dimensions. The experimental results showed that in a low-dimensional environment, the proposed algorithm improved indicator of cumulative rewards by 13% with respect to the existing mainstream RL Soft Actor-Critic (SAC) algorithm, and in a high-dimensional environment by 79%, requiring only a few explorations for the policy to converge. The sample efficiency increased by 27% and 52% in low- and high-dimensional environments, respectively.

Key words: Reinforcement Learning (RL), causal discovery, causal RL, causal mask, policy learning

黄思扬, 蔡瑞初, 乔杰, 郝志峰. 基于因果掩码的因果强化学习算法[J]. 计算机工程, 2025, 51(4): 66-74.

HUANG Siyang, CAI Ruichu, QIAO Jie, HAO Zhifeng. Causal Reinforcement Learning Algorithm Based on Causal Mask[J]. Computer Engineering, 2025, 51(4): 66-74.

https://www.ecice06.com/CN/Y2025/V51/I4/66

图/表 12

图1 基于因果掩码的因果强化学习框架

Fig.1 Causal reinforcement learning framework based on causal mask

图2 clip函数

Fig.2 clip function

图3 低维环境下不同算法在训练回合中的累积奖励

Fig.3 Cumulative rewards in training rounds for different algorithms in the low-dimensional environment

图4 低维环境下不同算法在训练回合中的交互次数

Fig.4 Number of interactions of different algorithms in training rounds in the low-dimensional environment

图5 低维环境下不同算法在训练回合中的告警次数

Fig.5 Number of alarms in training rounds of different algorithms in the low-dimensional environment

图6 高维环境下不同算法在训练回合中的累积奖励

Fig.6 Cumulative rewards in training rounds of different algorithms in the high-dimensional environment

图7 高维环境下不同算法在训练回合中的交互次数

Fig.7 Number of interactions in training rounds of different algorithms in the high-dimensional environment

图8 高维环境下不同算法在训练回合中的告警次数

Fig.8 Number of alarms in training rounds of different algorithms in the high-dimensional environment

图9 Pendulum-v1环境下不同算法的累积回合奖励

Fig.9 Cumulative episode rewards of different algorithms in the Pendulum-v1 environment

参考文献 33

1	SUTTON R S, BARTO A G. Reinforcement learning: an introduction. Cambridge, USA: MIT Press, 2018.
2	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529- 533. doi: 10.1038/nature14236
3	KOBER J, BAGNELL J A, PETERS J. Reinforcement learning in robotics: a survey. The International Journal of Robotics Research, 2013, 32(11): 1238- 1274. doi: 10.1177/0278364913495721
4	SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484- 489. doi: 10.1038/nature16961
5	SHAI S S, SHAKED S, AMNON S. Safe, multi-agent, reinforcement learning for autonomous driving[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1610.03295.
6	KAISER L, BABAEIZADEH M, MILOS P, et al. Model-based reinforcement learning for Atari[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1903.00374.
7	NAGABANDI A, KAHN G, FEARING R S, et al. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D. C., USA: IEEE Press, 2018: 7559-7566.
8	LORIS R, JEYHOON M, PAOLO F, et al. Model-based reinforcement learning variable impedance control for human-robot collaboration. Journal of Intelligent&Robotic Systems, 2020, 100, 417- 433.
9	SONTAKKE S, MEHRJOU A, ITTI L, et al. Causal curiosity: RL agents discovering self-supervised experiments for causal representation learning[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2021: 9848-9858.
10	WANG Z H, XIAO X S, ZHU Y K, et al. Task-independent causal state abstraction[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems, Robot Learning. Washington D. C., USA: IEEE Press, 2021: 1-10.
11	TOMAR M, ZHANG A, CALANDRA R, et al. Model-invariant state abstractions for model-based reinforcement learning[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/2102.09850.
12	ZHANG A, MCALLISTER R, CALANDRA R, et al. Learning invariant representations for reinforcement learning without reconstruction[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/2006.10742.
13	DING W H, LIN H H, LI B, et al. Generalizing goal-conditioned reinforcement learning with variational causal reasoning[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/2207.09081v6.
14	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing Atari with deep reinforcement learning[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1312.5602.
15	LILLICRAP T P, HUNT J J. Continuous control with deep reinforcement learning[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1509.02971.
16	WANG Z Y, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2016: 1995-2003.
17	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1707.06347.
18	SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust region policy optimization[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2015: 1889-1897.
19	MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2016: 1928-1937.
20	HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft Actor-Critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2018: 1861-1870.
21	DEPEWEG S, HERNÁNDEZ-LOBATO J M, DOSHI-VELEZ F, et al. Learning and policy search in stochastic dynamical systems with Bayesian neural networks[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/1605.07127.
22	DEISENROTH M, RASMUSSEN C E. PILCO: a model-based and data-efficient approach to policy search[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: IMLS, 2011: 465-472.
23	SPIRTES P, GLYMOUR C. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 1991, 9(1): 62- 72. doi: 10.1177/089443939100900106
24	SPIRTES P, GLYMOUR C, SCHEINES R. Causation, prediction, and search. Cambridge, USA: MIT Press, 2001.
25	CHICKERING D M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 2002, 3(Nov): 507- 554.
26	CAI R C, QIAO J, ZHANG Z, et al. SELF: structural equational likelihood framework for causal discovery[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2018: 1-8.
27	SPIRTES P, GLYMOUR C, SCHEINES R. Constructing Bayesian networks models of gene expression networks from microarray data[C]//Proceedings of the Atlantic Symposium on Computational Biology. Atlantic, USA: [s. n. ]: 2000, 255-259.
28	MALINSKY D, SPIRTES P. Causal structure learning from multivariate time series in settings with unmeasured confounding[C]//Proceedings of the 2018 ACM SIGKDD Workshop on Causal Discovery. New York, USA: ACM Press, 2018: 23-47.
29	GERHARDUS A, RUNGE J. High-recall causal discovery for autocorrelated time series with latent confounders. Advances in Neural Information Processing Systems, 2020, 33, 615- 625.
30	CAI R C, WU S Y, QIAO J, et al. THP: topological Hawkes processes for learning granger causality on event sequences[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/2105.10884.
31	HUANG B, ZHANG K, ZHANG J, et al. Causal discovery from heterogeneous/nonstationary data. Journal of Machine Learning Research, 2020, 21(89): 1- 53.
32	ABEL D. A theory of abstraction in reinforcement learning[EB/OL]. [2023-10-12]. https://arxiv.org/pdf/2203.00397.
33	PEARL J. Models, reasoning and inference. Cambridge, UK: Cambridge University Press, 2000,

[1]	林绍福, 陈盈盈, 李硕朋. 基于深度强化学习的多无人机能量传输与边缘计算联合优化方法[J]. 计算机工程, 2025, 51(3): 144-154.
[2]	李思源, 钟兴宇, 李凯茵, 徐清振. 基于多层图关系和强化学习的策略教学研究[J]. 计算机工程, 2025, 51(3): 122-130.
[3]	李淑怡, 阳波, 陈灵, 沈玲, 唐文胜. 自适应奖励函数的PPO曲面覆盖方法[J]. 计算机工程, 2025, 51(3): 86-94.
[4]	孙浩淼, 李宗民, 肖倩, 孙文洁, 张雯欣. AI-Curling: 一种冰壶现场分析与决策方法[J]. 计算机工程, 2025, 51(2): 102-110.
[5]	曾建州, 李泽平, 张素勤. 基于TD3算法的多智能体协作缓存策略[J]. 计算机工程, 2025, 51(2): 365-374.
[6]	石琼, 段辉, 师智斌. 基于深度强化学习的可信任务卸载方案[J]. 计算机工程, 2024, 50(8): 142-152.
[7]	钱清, 龙永, 蒋忠远, 段春红, 王宏. 基于深度强化学习的自适应图像隐写算法[J]. 计算机工程, 2024, 50(8): 319-327.
[8]	高家豪, 胡创业, 丁男, 刘战东. 智能网联汽车中联合驾驶风格的交通流数据有效性分析[J]. 计算机工程, 2024, 50(6): 367-376.
[9]	孙文洁, 李宗民, 孙浩淼. 基于图神经网络的多智能体强化学习值函数分解方法[J]. 计算机工程, 2024, 50(5): 62-70.
[10]	傅明建, 郭福强. 基于深度强化学习的无信号灯路口决策研究[J]. 计算机工程, 2024, 50(5): 91-99.
[11]	张斯力, 李梓健, 蔡瑞初, 郝志峰, 闫玉光. 基于因果机制约束的强化推荐系统[J]. 计算机工程, 2024, 50(5): 279-290.
[12]	郝志峰, 丁凯培, 蔡瑞初, 陈薇. 基于非稳态加性噪声模型的因果发现算法[J]. 计算机工程, 2024, 50(4): 78-86.
[13]	卢小金, 陈薇, 郝志峰, 蔡瑞初. 基于因果自回归流模型的因果结构学习算法[J]. 计算机工程, 2024, 50(3): 131-136.
[14]	冯雄波, 黄于欣, 赖华, 高玉梦. 基于多策略强化学习的低资源跨语言摘要方法研究[J]. 计算机工程, 2024, 50(2): 68-77.
[15]	杜海军, 余粟. 基于时空图注意力网络的服务机器人动态避障[J]. 计算机工程, 2024, 50(2): 105-112.

选择文件类型/文献管理软件名称

选择包含的内容