A new cost aware-based Spark job scheduling algorithm

doi:10.19678/j.issn.1000-3428.0253408

Abstract

Abstract: Big data processing frameworks like Apache Spark have gained significant attention due to their widespread applications in large-scale data analysis. However, it is difficult to balance computing costs and runtime performance by relying solely on a single deployment mode (e.g., on-premises or cloud-based), especially when handling data-intensive tasks. Hybrid cloud deployment combines local resources and public cloud resources to offer a flexible and efficient solution that is able to balance the cost and performance. The job scheduling in hybrid cloud environments faces numerous challenges, including optimizing resource utilization and job execution costs. Existing scheduling algorithms often fail to fully account for the directed acyclic graph (DAG) structure of Spark jobs and the characteristics of multi-stage scheduling. This leads to prolonged job execution times in scenarios with parallel jobs and inability to reduce costs in an effective way. To address these issues, this paper proposes an innovative cost-aware particle swarm optimization (CA-PSO) scheduling algorithm for Spark jobs. By incorporating a cost model, the algorithm includes the rental costs of virtual machine (VM) instances in its optimization objectives and dynamically adjusts resource allocation strategies to minimize resource usage while meeting performance requirements, thereby reducing cluster operational costs. Additionally, the scheduling algorithm leverages the DAG dependency structure of Spark jobs and introduces a multi-Spark job, multi-stage scheduling mechanism to optimize resource allocation strategies and stage execution order. This approach not only effectively reduces cluster costs but also significantly improves the overall performance of multi-job scheduling in a hybrid cloud environment. Simulation and real-cluster experimental results demonstrate that, compared to existing scheduling algorithms, the CA-PSO Spark job scheduling algorithm exhibits excellent scalability, adapts to different VM pricing models and various Spark job types, and can reduce the usage cost of hybrid clusters.

摘要： 大数据计算框架Apache Spark因其广泛应用于分布式大数据分析场景而备受关注。然而，仅依赖单一部署方式（如本地或云端）难以同时兼顾计算成本和运行性能，特别是在处理数据密集型计算任务时。混合云部署通过结合本地资源和公共云资源，提供了一种兼顾成本和性能的灵活、高效解决方案，但其在作业调度方面仍面临诸多挑战，包括如何优化资源利用率与作业执行成本。现有调度算法通常未能充分考虑Spark作业的有向无环图依赖结构及多阶段调度的特性，导致在多作业并行处理场景中作业运行时间较长，且未能有效降低使用成本。为此，本文提出了一种新的基于粒子群优化的成本感知Spark作业调度算法CA-PSO。该算法通过引入成本模型，将虚拟机实例的租赁费用纳入到算法的优化目标中，并动态调整资源分配策略，在满足性能需求的同时最小化对集群资源的利用。此外，该算法充分利用Spark作业的有向无环图依赖结构，设计了多Spark作业的多阶段调度机制，优化资源分配策略和阶段执行顺序，显著提升混合云环境下的多作业调度性能。仿真实验和真实集群实验结果表明，与通用的作业调度算法相比，CA-PSO作业调度算法具有良好的可扩展性，能够适应不同虚拟机定价模型和多种Spark作业类型，可以显著降低混合集群的使用成本。

HE Yu-lin, HE Jia-hao, MO Pei-heng, KAN Zheng, CUI Lai-zhong, HUANG Zhe-xue. A new cost aware-based Spark job scheduling algorithm[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253408.

何玉林, 贺家豪, 莫沛恒, 阚铮, 崔来中, 黄哲学. 一种新的基于成本感知的Spark作业调度算法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253408.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0253408

References

[1] VAVILAPALLI V K, MURTHY A C, DOUGLAS C, et al. Apache Hadoop YARN: Yet another resource negotiator[C]//Proc of the 4th annual Symposium on Cloud Computing, New York, ACM, 2013: 1-16.
[2] ZAHARIA M, REYNOLD S X, WENDELL P, et al. Apache Spark: A Unified Engine for Big Data Processing[J]. Communications of the ACM, 2016, 59(11): 56-65.
[3] ZAHARIA M, CHOWDHURYET M, TATHAGATA D, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing[C]//Proc of the 9th USENIX conference on Networked Systems Design and Implementation, New York, ACM, 2012: 15-28.
[4] CAI Y J, JIN S, YU W W, et al. Cooperative Distributed Resource Allocation in Heterogeneous Networks with D2D Communication[J]. IEEE Transactions on Vehicular Technology, 2023, 72(12): 16426-16440.
[5] LI H J, ZHU L S, WANG S C, et al. Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment[J]. Journal of Grid Computing, 2023, 21: 33.
[6] DEBABRATA S, KOUSIK R, CHANDAN K, et al. TMDS: Temperature-aware Makespan Minimizing DAG Scheduler for Heterogeneous Distributed Systems[J]. ACM Transactions on Design Automation of Electronic Systems, 2023, 28(6): 22.
[7] SONG Y X, YU J Y, WANG J J, et al. Memory Management Optimization Strategy in Spark Framework Based on Less Contention[J]. The Journal of Supercomputing, 2023, 79: 1504-1525.
[8] KENNEDY J, EBERHART R. Particle Swarm Optimization[C]//Proc of the 1995 International Conference on Neural Networks, WA, Australia, IEEE, 1995: 1942-1948.
[9] HUSSAIN M, LUO M X, HUSSAIN A, et al. Deadline-Constrained Cost-Aware Workflow Scheduling in Hybrid Cloud[J]. Simulation Modelling Practice and Theory, 2023, 129: 102819.
[10] 严磊, 张功萱, 王添, 等. 混合云下具有交付期约束的众包任务调度算法[J]. 计算机科学, 2022, 49(05): 244-249.YAN L, ZHANG G X, WANG T, et al. Crowdsourcing Task Scheduling Algorithm with Delivery Time Constraints in Hybrid Cloud[J]. Computer Science, 2022, 49(05): 244-249.
[11] RAMESH D, RIZVI N, SRINIVASA R, et al. Improved Chemical Reaction Optimization with Fitness-Based Quasi-Reflection Method for Scheduling in Hybrid Cloud-Fog Environment[J]. IEEE Transactions on Network and Service Management, 2024, 21(1): 653-669.
[12] MIKRAM H, KAFHALI S E, SAADI Y. HEPGA: A New Effective Hybrid Algorithm for Scientific Workflow Scheduling in Cloud Computing Environment[J]. Simulation Modelling Practice and Theory, 2024, 130: 102864.
[13] SUN Z X, HUANG H J, LI Z K, et al. Efficient, Economical and Energy-Saving Multi-Workflow Scheduling in Hybrid Cloud[J]. Expert Systems with Applications, 2023, 228: 120401.
[14] PAL S, JHANJHI N Z, ABDULBAQI A S, et al. An Intelligent Task Scheduling Model for Hybrid Internet of Things and Cloud Environment for Big Data Applications[J]. Sustainability, 2023, 15(6): 5104.
[15] STAVRINIDES G L, KARATZA H D. Dynamic Scheduling of Bags-Of-Tasks with Sensitive Input Data and End-To-End Deadlines in a Hybrid Cloud[J]. Multimedia Tools and Applications, 2021, 80: 16781-16803.
[16] 林莉, 毛新雅, 储振兴, 等. 混合云环境下面向数据生命周期的自适应访问控制[J]. 软件学报, 2024, 35(3): 1357-1376.LIN L, MAO X Y, CHU Z X, et al. Adaptive Access Control for Data Lifecycle in Hybrid Cloud Environment[J]. Journal of Software, 2024, 35(3): 1357-1376.
[17] XIE Y, WANG, X Y, SHEN Z J, et al. A Two-Stage Estimation of Distribution Algorithm with Heuristics for Energy-Aware Cloud Workflow Scheduling[J]. IEEE Transactions on Services Computing, 2023, 16(6): 4183-4197.
[18] YE L J, XIA Y Q, YANG L W, et al. Dynamic Scheduling Stochastic Multiworkflows with Deadline Constraints in Clouds[J]. IEEE Transactions on Automation Science and Engineering, 2023, 20(4): 2594-2606.
[19] YE L J, XIA Y Q, TAO S Y, et al. Reliability-Aware and Energy-Efficient Workflow Scheduling in IaaS Clouds[J]. IEEE Transactions on Automation Science and Engineering, 2023, 20(3): 2156-2169.
[20] TONG Y L, LIU J Z, WANG H, et al. DAG-Aware Harmonizing Job Scheduling and Data Caching for Disaggregated Analytics Frameworks[J]. Future Generation Computer Systems,2024, 156: 116-129.
[21] LU S X, ZHAO M M, LI C L, et al. Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters[J]. The Computer Journal, 2024, 67(2): 762-776.
[22] ZHOU Y F, LI X J, LUO J H, et al. Learning to Optimize DAG Scheduling in Heterogeneous Environment[C]//Proc of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM), Washington D. C.: IEEE, 2022: 137-146.
[23] DUAN Y B, WANG N, WU J. Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling[J]. Journal of Computer Science and Technology, 2022, 37: 852-868.
[24] WANG Q Y, BIN G, ZHI Z, et al. DAG-Aware Optimization for Geo-Distributed Data Analytics[C]//Proc of the 52nd International Conference on Parallel Processing, New York: ACM, 2023: 472-481.
[25] UETER N, GÜNZEL M, BRÜGGEN, et al. Parallel Path Progression DAG Scheduling[J]. IEEE Transactions on Computers, 2023, 72(10): 3002-3016.
[26] FU Z M, HE M S, TANG Z, et al. Optimizing Data Locality by Executor Allocation in Spark Computing Environment[J]. Computer Science and Information Systems. 2023, 20(1): 491-512.
[27] HERODOTOS H, ELENA K. Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems[J]. ACM Transactions Database System, 2023, 48(4): 40.
[28] RAJPUT K Y, LI X P, ABDULLAH L, et al, Task Scheduling in Multi-Cloud Environments for Spark Workflow under Performance Uncertainty[C]//Proc of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Washington D. C.: IEEE, 2024: 2752-2757.
[29] FU Z M, HE M S, YI Y, et al. Improving Data Locality of Tasks by Executor Allocation in Spark Computing Environment[J]. IEEE Transactions on Cloud Computing, 2024, 12(3): 876-888.
[30] ABDELRAHMAN E, YAN J H, ZHANG M Y. A Parallel Distributed Genetic Algorithm Using Apache Spark for Flexible Scheduling of Multitasks in a Cloud Manufacturing Environment[J]. International Journal of Computer Integrated Manufacturing, 2023, 37: 652-667.
[31] 何玉林, 莫沛恒, 黄哲学, 等. 一种新的期限与成本平衡为导向的Spark作业调度算法[J/OL]. 计算机工程, 1-15 [2025-01-05]. https://doi.org/10.19678/j.issn.1000-3428.0069968. HE Y L, MO P H, HUANG Z X, et al. A Novel Spark Job Scheduling Algorithm Based on Deadline-Cost Balance[J/OL]. Computer Engineering, 1-15 [2025-01-05]. https://doi.org/10.19678/j.issn. 1000-3428. 0069968.
[32] JUVE G, CHERVENAK A, DEELMAN E, et al. Characterizing and Profiling Scientific Workflows[J]. Future Generator Computing System, 2013, 29(3): 682–692.
[33] Apache Spark. Spark Scheduling. https://spark.apache.org/docs/latest/job-scheduling.html
[34] Apache Spark. Fair Scheduler. https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
[35] ISLAM M T, WU H M, KARUNASEKERA S, et al. SLA-Based Scheduling of Spark Jobs in Hybrid Cloud Computing Environments[J]. IEEE Transactions on Computers, 2021, 71(5): 1117-1132.
[36] MAO H Z, MALTE S, SHAILESHH B V, et al. Learning Scheduling Algorithms for Data Processing Clusters[C]//Proc of the ACM Special Interest Group on Data Communication, New York, ACM, 2019: 270-288.
[37] VERMA V P, KUMAR S, KUMAR S, et al. Optimizing Spark Job Scheduling with Distributional Deep Learning in Cloud Environments[J]. Journal of Cloud Computing, 2025, 14(1): article number 59.
[38] CAI J, LU L J. A Deep Reinforcement Learning Approach with Attention Mechanism for DAG Task Scheduling in Data Centers[J]. Concurrency and Computation: Practice and Experience, 2025, 37(25-26): article number e70279.
[39] RAJPUT K Y, LI X P, LAKHAN A. (2025). Spark Workflow Task Scheduling with Deadline and Privacy Constraints in Hybrid Cloud Networks[J]. Soft Computing, 2025, 29(2): 783-801.
[40] 何玉林, 吴东彤, Fournier-Viger Philippe , 等. 基于优先填补策略的Spark数据均衡分区方法[J]. 电子学报, 2024, 52(10): 3322-3335 HE Y L, WU D T, FOURNIER-VIGER P, et al. First Filling Strategy-Based Partitioning Method to Balance Data in Spark[J]. Acta Electronica Sinica, 2024, 52(10): 3322-3335.

Please choose a citation manager

Content to export