作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (5): 396-403. doi: 10.19678/j.issn.1000-3428.0069472

• 新一代网络与边缘计算 • 上一篇    下一篇

可靠性分解的云计算容错调度方法

尹超, 史旭华*()   

  1. 宁波大学信息科学与工程学院, 浙江 宁波 315211
  • 收稿日期:2024-03-04 修回日期:2024-09-29 出版日期:2026-05-15 发布日期:2024-12-11
  • 通讯作者: 史旭华
  • 作者简介:

    尹超, 男, 硕士, 主研方向为调度优化

    史旭华(通信作者), 教授

  • 基金资助:
    国家自然科学基金(61773225); 宁波市重点研发计划暨"揭榜挂帅"项目(2023Z067)

Fault-tolerant Scheduling Method for Cloud Computing Based on Reliability Decomposition

YIN Chao, SHI Xuhua*()   

  1. School of Information Science and Engineering, Ningbo University, Ningbo 315211, Zhejiang, China
  • Received:2024-03-04 Revised:2024-09-29 Online:2026-05-15 Published:2024-12-11
  • Contact: SHI Xuhua

摘要:

云计算环境中普遍采用工作流执行方式。在云计算工作流任务执行过程中, 可靠性是一个重要的质量评价指标。目前能满足工作流任务计算的可靠性要求且在时间和成本上得到优化的方法较少。基于神经网络等的调度算法在工作流规模较大时需要大量时间寻找参数优化模型, 已有基于可靠性分解的调度算法的分解策略有待进一步改善。为此, 提出一种可靠性分解的容错调度方法。该方法是一种启发式方法, 包含调度优先级计算、可靠性分配权重计算、可靠性要求初次分解、任务副本的虚拟机选择。该方法主要优化了可靠性分解策略和虚拟机选择策略, 其中可靠性分解策略基于工作流任务的大小和前驱后继关系, 虚拟机选择策略基于相对完成时间和成本的加权。在不同类型和规模的工作流及不同可靠性要求下进行实验, 结果表明该方法满足可靠性要求, 同时在完成时间和成本中的综合表现较好, 优于3种对比算法QFEC、QFEC+、C_GM, 为云计算工作流执行的可靠性分解和容错调度研究提供了新的解决方案和思路。

关键词: 云计算, 工作流, 容错, 可靠性, 调度

Abstract:

Workflow is a commonly adopted execution paradigm in cloud computing environments. Reliability is a crucial Quality of Service (QoS) metric in the process of executing cloud workflow tasks. Currently, methods that can simultaneously satisfy the reliability requirements of workflow computation while optimizing both time and cost are scarce. Neural network-based algorithms require substantial time to search for optimized parameter models when handling large-scale workflows, and the decomposition strategies of existing reliability-based algorithms require further improvement. To address these issues, this paper proposes a reliability decomposition-based fault-tolerant scheduling method. This heuristic method consists of the following steps: calculating task-scheduling priorities, determining reliability allocation weights, performing an initial decomposition of the overall reliability requirement, and selecting Virtual Machines (VMs) for task replicas. The core of this method lies in the optimization of two strategies, namely reliability decomposition and VM selection. The reliability decomposition strategy is designed based on the computational size of workflow tasks and their predecessor-successor dependencies, while the VM selection strategy operates based on a weighted function that balances relative task completion time and execution cost. Experiments are conducted using various workflow types, scales, and reliability requirements. The results indicate that the proposed method satisfies the specified reliability requirements. Moreover, it demonstrates superior comprehensive performance in balancing completion time and cost, outperforming three baseline algorithms: QFEC, QEEC+, and C_GM. This paper provides new solutions and insights for research on reliability decomposition and fault-tolerant scheduling in cloud workflow execution.

Key words: cloud computing, workflow, fault-tolerance, reliability, scheduling