可靠性分解的云计算容错调度方法

doi:10.19678/j.issn.1000-3428.0069472

摘要/Abstract

摘要：

云计算环境中普遍采用工作流执行方式。在云计算工作流任务执行过程中, 可靠性是一个重要的质量评价指标。目前能满足工作流任务计算的可靠性要求且在时间和成本上得到优化的方法较少。基于神经网络等的调度算法在工作流规模较大时需要大量时间寻找参数优化模型, 已有基于可靠性分解的调度算法的分解策略有待进一步改善。为此, 提出一种可靠性分解的容错调度方法。该方法是一种启发式方法, 包含调度优先级计算、可靠性分配权重计算、可靠性要求初次分解、任务副本的虚拟机选择。该方法主要优化了可靠性分解策略和虚拟机选择策略, 其中可靠性分解策略基于工作流任务的大小和前驱后继关系, 虚拟机选择策略基于相对完成时间和成本的加权。在不同类型和规模的工作流及不同可靠性要求下进行实验, 结果表明该方法满足可靠性要求, 同时在完成时间和成本中的综合表现较好, 优于3种对比算法QFEC、QFEC+、C_GM, 为云计算工作流执行的可靠性分解和容错调度研究提供了新的解决方案和思路。

关键词: 云计算, 工作流, 容错, 可靠性, 调度

Abstract:

Workflow is a commonly adopted execution paradigm in cloud computing environments. Reliability is a crucial Quality of Service (QoS) metric in the process of executing cloud workflow tasks. Currently, methods that can simultaneously satisfy the reliability requirements of workflow computation while optimizing both time and cost are scarce. Neural network-based algorithms require substantial time to search for optimized parameter models when handling large-scale workflows, and the decomposition strategies of existing reliability-based algorithms require further improvement. To address these issues, this paper proposes a reliability decomposition-based fault-tolerant scheduling method. This heuristic method consists of the following steps: calculating task-scheduling priorities, determining reliability allocation weights, performing an initial decomposition of the overall reliability requirement, and selecting Virtual Machines (VMs) for task replicas. The core of this method lies in the optimization of two strategies, namely reliability decomposition and VM selection. The reliability decomposition strategy is designed based on the computational size of workflow tasks and their predecessor-successor dependencies, while the VM selection strategy operates based on a weighted function that balances relative task completion time and execution cost. Experiments are conducted using various workflow types, scales, and reliability requirements. The results indicate that the proposed method satisfies the specified reliability requirements. Moreover, it demonstrates superior comprehensive performance in balancing completion time and cost, outperforming three baseline algorithms: QFEC, QEEC+, and C_GM. This paper provides new solutions and insights for research on reliability decomposition and fault-tolerant scheduling in cloud workflow execution.

Key words: cloud computing, workflow, fault-tolerance, reliability, scheduling

尹超, 史旭华. 可靠性分解的云计算容错调度方法[J]. 计算机工程, 2026, 52(5): 396-403.

YIN Chao, SHI Xuhua. Fault-tolerant Scheduling Method for Cloud Computing Based on Reliability Decomposition[J]. Computer Engineering, 2026, 52(5): 396-403.

https://www.ecice06.com/CN/Y2026/V52/I5/396

图/表 10

图1 工作流模型示例

Fig.1 Example of workflow model

图2 线性代数工作流中完成时间的实验结果

Fig.2 Experimental results of completion time for linear algebra workflow

图3 线性代数工作流中成本的实验结果

Fig.3 Experimental results of cost for linear algebra workflow

图4 高斯消去工作流中完成时间的实验结果

Fig.4 Experimental results of completion time for Gaussian elimination workflow

图5 高斯消去工作流中成本的实验结果

Fig.5 Experimental results of cost for Gaussian elimination workflow

图6 快速傅里叶变换工作流中完成时间的实验结果

Fig.6 Experimental results of completion time for fast Fourier transform workflow

图7 快速傅里叶变换工作流中成本的实验结果

Fig.7 Experimental results of cost for fast Fourier transform workflow

参考文献 25

1	GOLMOHAMMADI A , TABBAKH S R K , GHAEMI R . A review on workflow scheduling and resource allocation algorithms in distributed mobile clouds. Transactions on Emerging Telecommunications Technologies, 2023, 34 (8): e4811. doi: 10.1002/ett.4811
2	马小平, 贾向东, 薛凯来, 等. 基于短包通信的随机到达WPCN信息年龄优化调度方案. 计算机工程, 2025, 51 (12): 268- 276. doi: 10.19678/j.issn.1000-3428.0069300
	MA X P , JIA X D , XUE K L , et al. Scheduling scheme of age of information optimization for WPCN with stochastic arrivals based on short packet communication. Computer Engineering, 2025, 51 (12): 268- 276. doi: 10.19678/j.issn.1000-3428.0069300
3	张文帅, 李会民, 李京, 等. 一种集成于超算作业调度系统应用的并行参数优化方法. 计算机工程, 2025, 51 (7): 59- 67. doi: 10.19678/j.issn.1000-3428.0069035
	ZHANG W S , LI H M , LI J , et al. A parallel parameter optimization method integrated with job scheduling system for supercomputing applications. Computer Engineering, 2025, 51 (7): 59- 67. doi: 10.19678/j.issn.1000-3428.0069035
4	王玥, 田燕军, 王莉. 云计算技术应用与发展. 山西电子技术, 2022 (6): 69- 71.
	WANG Y , TIAN Y J , WANG L . The application and development of cloud computing technology. Shanxi Electronic Technology, 2022 (6): 69- 71.
5	MOHAMED A , HAMDAN M , KHAN S , et al. Software-defined networks for resource allocation in cloud computing: a survey. Computer Networks, 2021, 195, 108151. doi: 10.1016/j.comnet.2021.108151
6	ZHANG Q Q , GENG S J , CAI X J . Survey on task scheduling optimization strategy under multi-cloud environment. Computer Modeling in Engineering & Sciences, 2023, 135 (3): 1863- 1900.
7	陈红华, 崔翛龙, 王耀杰. 基于多种云环境的任务调度算法综述. 计算机应用研究, 2023, 40 (10): 2889- 2895.
	CHEN H H , CUI X L , WANG Y J . Overview of task scheduling algorithms based on various cloud environments. Application Research of Computers, 2023, 40 (10): 2889- 2895.
8	李超, 朱巧明, 李培峰, 等. 网格工作流调度研究综述. 计算机应用与软件, 2008, 25 (10): 279- 282.
	LI C , ZHU Q M , LI P F , et al. A survey on grid workflow scheduling. Computer Applications and Software, 2008, 25 (10): 279- 282.
9	MEDARA R , SINGH R S . A review on energy-aware scheduling techniques for workflows in IaaS clouds. Wireless Personal Communications, 2022, 125 (2): 1545- 1584. doi: 10.1007/s11277-022-09621-1
10	SOVEIZI N , TURKMEN F , KARASTOYANOVA D . Security and privacy concerns in cloud-based scientific and business workflows: a systematic review. Future Generation Computer Systems, 2023, 148, 184- 200. doi: 10.1016/j.future.2023.05.015
11	ZHAO L P, REN Y Z, XIANG Y, et al. Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems[C]//Proceedings of the 12th International Conference on High Performance Computing and Communications (HPCC). Washington D.C., USA: IEEE Press, 2010: 434-441.
12	ZHAO L P , REN Y Z , SAKURAI K . Reliable workflow scheduling with less resource redundancy. Parallel Computing, 2013, 39 (10): 567- 585. doi: 10.1016/j.parco.2013.06.003
13	XIE G Q , ZENG G , CHEN Y K , et al. Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Transactions on Services Computing, 2020, 13 (5): 871- 886. doi: 10.1109/TSC.2017.2665552
14	XIE G Q , ZENG G , LI R F , et al. Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Transactions on Cloud Computing, 2020, 8 (4): 1223- 1236. doi: 10.1109/TCC.2017.2780098
15	XIE G Q , WEI Y H , LE Y , et al. Redundancy minimization and cost reduction for workflows with reliability requirements in cloud-based services. IEEE Transactions on Cloud Computing, 2022, 10 (1): 633- 647. doi: 10.1109/TCC.2019.2937933
16	ZHU J , WANG L Z , XIE G Q , et al. A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems. The Journal of Supercomputing, 2021, 77 (4): 3450- 3483. doi: 10.1007/s11227-020-03403-x
17	DONG T T , XUE F , TANG H L , et al. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Applied Intelligence, 2023, 53 (9): 9916- 9932. doi: 10.1007/s10489-022-03963-w
18	MOUSAVI NIK S S , NAGHIBZADEH M , SEDAGHAT Y . Task replication to improve the reliability of running workflows on the cloud. Cluster Computing, 2021, 24 (1): 343- 359. doi: 10.1007/s10586-020-03109-y
19	ISMAYILOV G , TOPCUOGLU H R . Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing. Future Generation Computer Systems, 2020, 102, 307- 322. doi: 10.1016/j.future.2019.08.012
20	LI M , PI D C , QIN S . Knowledge-based multi-objective estimation of distribution algorithm for solving reliability constrained cloud workflow scheduling. Cluster Computing, 2024, 27 (2): 1401- 1419. doi: 10.1007/s10586-023-04022-w
21	SINGH P , DUTTA M , AGGARWAL N . Hybrid meta-heuristic approach for workflow scheduling in IaaS cloud. Arabian Journal for Science and Engineering, 2021, 46 (9): 9101- 9113. doi: 10.1007/s13369-021-05774-6
22	BELGACEM A , BEGHDAD-BEY K . Multi-objective workflow scheduling in cloud computing: trade-off between makespan and cost. Cluster Computing, 2022, 25 (1): 579- 595. doi: 10.1007/s10586-021-03432-y
23	TOPCUOGLU H , HARIRI S , WU M Y . Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 2002, 13 (3): 260- 274. doi: 10.1109/71.993206
24	DAOUD M I , KHARMA N . A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing, 2008, 68 (4): 399- 409.
25	KADOTA I , SINHA A , MODIANO E . Scheduling algorithms for optimizing age of information in wireless networks with throughput constraints. ACM Transactions on Networking, 2022, 27 (4): 1359- 1372.

[1]	赵庶旭, 周宏泽, 王小龙. 基于改进DQN的最优联盟结构生成策略优化[J]. 计算机工程, 2026, 52(5): 117-128.
[2]	李亮, 肖名志, 陈曦. 区块链技术的去中心化新闻检索与聚合架构研究[J]. 计算机工程, 2026, 52(5): 303-325.
[3]	杨定裕, 邓喻丰, 钱诗友, 曹健, 薛广涛. 基于成分分解和多模态融合的云数据库产品用量预测[J]. 计算机工程, 2026, 52(3): 355-363.
[4]	何玉林, 莫沛恒, 黄哲学, Philippe Fournier-Viger. 一种新的截止期限与成本平衡为导向的Spark作业调度算法[J]. 计算机工程, 2026, 52(3): 318-331.
[5]	王桂兰, 张成, 周国亮. 结合FISCO BCOS与拓扑优化一致性算法的配电网多目标经济调度[J]. 计算机工程, 2025, 51(7): 348-361.
[6]	张文帅, 李会民, 李京, 潘必才. 一种集成于超算作业调度系统应用的并行参数优化方法[J]. 计算机工程, 2025, 51(7): 59-67.
[7]	冯爽, 江波, 徐宏. 基于进化自适应蝙蝠算法的异构多核处理器任务调度[J]. 计算机工程, 2025, 51(5): 249-256.
[8]	王克文, 张维庭, 孙童. 空天地一体化算力网络资源调度机制[J]. 计算机工程, 2025, 51(5): 52-61.
[9]	李晓辉, 资湖海, 徐坷鑫, 牛樱清, 赵毅, 董媛. 带有充电约束的多AGV柔性作业车间调度[J]. 计算机工程, 2025, 51(4): 314-326.
[10]	朱亚州, 杜平川, 柴志雷. 基于Kubernetes的异构任务调度方法[J]. 计算机工程, 2025, 51(12): 337-345.
[11]	王华华, 黄烨霞, 李玲, 王嘉程. 无蜂窝网络中的联邦学习用户调度与资源优化[J]. 计算机工程, 2025, 51(12): 255-267.
[12]	马小平, 贾向东, 薛凯来, 牛夏秧, 张亮. 基于短包通信的随机到达WPCN信息年龄优化调度方案[J]. 计算机工程, 2025, 51(12): 268-276.
[13]	杨嘉卉, 尤再进, 倪立夫, 赵煜, 李婉莹. 集装箱码头泊位-岸桥减排协同调度优化研究[J]. 计算机工程, 2025, 51(10): 381-391.
[14]	郭羽含, 李文华, 钱亚冠. 融合时空流差的网约车双模式混合调度算法[J]. 计算机工程, 2024, 50(6): 377-393.
[15]	郑锦灿, 邵立珍, 雷雪梅. 基于改进MOEA/D的模糊柔性作业车间调度算法[J]. 计算机工程, 2024, 50(6): 336-345.

选择文件类型/文献管理软件名称

选择包含的内容