Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

A new cost aware-based Spark job scheduling algorithm

  

  • Published:2026-03-17

一种新的基于成本感知的Spark作业调度算法

Abstract: Big data processing frameworks like Apache Spark have gained significant attention due to their widespread applications in large-scale data analysis. However, it is difficult to balance computing costs and runtime performance by relying solely on a single deployment mode (e.g., on-premises or cloud-based), especially when handling data-intensive tasks. Hybrid cloud deployment combines local resources and public cloud resources to offer a flexible and efficient solution that is able to balance the cost and performance. The job scheduling in hybrid cloud environments faces numerous challenges, including optimizing resource utilization and job execution costs. Existing scheduling algorithms often fail to fully account for the directed acyclic graph (DAG) structure of Spark jobs and the characteristics of multi-stage scheduling. This leads to prolonged job execution times in scenarios with parallel jobs and inability to reduce costs in an effective way. To address these issues, this paper proposes an innovative cost-aware particle swarm optimization (CA-PSO) scheduling algorithm for Spark jobs. By incorporating a cost model, the algorithm includes the rental costs of virtual machine (VM) instances in its optimization objectives and dynamically adjusts resource allocation strategies to minimize resource usage while meeting performance requirements, thereby reducing cluster operational costs. Additionally, the scheduling algorithm leverages the DAG dependency structure of Spark jobs and introduces a multi-Spark job, multi-stage scheduling mechanism to optimize resource allocation strategies and stage execution order. This approach not only effectively reduces cluster costs but also significantly improves the overall performance of multi-job scheduling in a hybrid cloud environment. Simulation and real-cluster experimental results demonstrate that, compared to existing scheduling algorithms, the CA-PSO Spark job scheduling algorithm exhibits excellent scalability, adapts to different VM pricing models and various Spark job types, and can reduce the usage cost of hybrid clusters.

摘要: 大数据计算框架Apache Spark因其广泛应用于分布式大数据分析场景而备受关注。然而,仅依赖单一部署方式(如本地或云端)难以同时兼顾计算成本和运行性能,特别是在处理数据密集型计算任务时。混合云部署通过结合本地资源和公共云资源,提供了一种兼顾成本和性能的灵活、高效解决方案,但其在作业调度方面仍面临诸多挑战,包括如何优化资源利用率与作业执行成本。现有调度算法通常未能充分考虑Spark作业的有向无环图依赖结构及多阶段调度的特性,导致在多作业并行处理场景中作业运行时间较长,且未能有效降低使用成本。为此,本文提出了一种新的基于粒子群优化的成本感知Spark作业调度算法CA-PSO。该算法通过引入成本模型,将虚拟机实例的租赁费用纳入到算法的优化目标中,并动态调整资源分配策略,在满足性能需求的同时最小化对集群资源的利用。此外,该算法充分利用Spark作业的有向无环图依赖结构,设计了多Spark作业的多阶段调度机制,优化资源分配策略和阶段执行顺序,显著提升混合云环境下的多作业调度性能。仿真实验和真实集群实验结果表明,与通用的作业调度算法相比,CA-PSO作业调度算法具有良好的可扩展性,能够适应不同虚拟机定价模型和多种Spark作业类型,可以显著降低混合集群的使用成本。