Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

An Online Reinforcement Learning Task Scheduling Algorithm Based on Policy Entropy Supervision

  

  • Published:2026-03-17

基于策略熵监督的在线强化学习任务调度算法

Abstract: In cloud computing environments, workloads and resource states change continuously over time, which often causes reinforcement-learning-based scheduling policies to suffer from unstable randomness during online execution, leading to increased energy consumption or degraded response time. Conventional Soft Actor–Critic (SAC) mainly relies on temperature tuning during training to control policy randomness, and thus struggles to adapt promptly to non-stationary workloads in real systems. To address this issue, this paper proposes an entropy-supervised Soft Actor–Critic algorithm for online cloud task scheduling, referred to as ESAC. Without altering the original training structure, ESAC introduces a policy entropy supervision mechanism during inference to monitor policy randomness in real time and triggers lightweight entropy feedback fine-tuning when the entropy deviates from a stable range, enabling fast correction with constant computational cost. In addition, sliding-window reward normalization and periodic incremental updates are employed to alleviate numerical instability caused by reward scale drift under dynamic workloads. Experiments based on dynamic workload simulations constructed from the Alibaba Cluster Trace 2018 demonstrate that ESAC consistently outperforms several representative scheduling algorithms under different load intensities and burst scenarios, reducing the average energy consumption per task by about 1.8% and the average response time by up to 3.01%. Compared with the A2C baseline, ESAC achieves improvements of 70.7%, 76.0%, and 76.2% in the composite performance metric under three load scenarios, while maintaining acceptable online scheduling overhead. These results verify the effectiveness of the proposed method in enhancing the stability and adaptability of online scheduling in non-stationary cloud environments.

摘要: 云计算环境中负载与资源状态随时间持续变化,易导致基于强化学习的调度策略在推理阶段出现随机性失稳,从而引发能耗上升或响应时间恶化。传统软演员–评论家算法(SAC)主要依赖训练阶段的温度调节机制控制策略随机性,在非平稳负载条件下难以及时适应实时系统变化。针对该问题,本文提出一种面向在线云任务调度的熵监督软演员–评论家算法(ESAC)。在保持原有算法训练结构不变的前提下,ESAC在推理阶段引入策略熵监督机制,实时监测策略随机性状态,并在熵值偏离稳定区间时触发轻量级熵反馈微调,以常数级计算代价实现对策略随机性的快速修正。同时,结合滑动窗口奖励标准化与周期性增量更新,缓解动态负载下奖励尺度漂移带来的数值不稳定问题。基于Alibaba Cluster Trace v2018构建的动态负载仿真实验结果表明,ESAC在不同负载强度与突发场景下均优于多种代表性调度算法,单位任务平均能耗降低约1.8%,平均响应时间最大降低3.01%,相较于A2C,其在三种负载场景下的综合性能指标分别提升70.7%、76.0%和76.2%,且在线调度开销保持在可接受范围内。实验结果验证了所提方法在非平稳云环境中提升在线调度稳定性与适应性的有效性。