Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2020, Vol. 46 ›› Issue (1): 187-195. doi: 10.19678/j.issn.1000-3428.0053582

Previous Articles     Next Articles

Dual Layer Job Scheduling System for Large Scale Heterogeneous Computing Clusters

SUN Zhenyu1,2, SHI Jingyan1, SUN Gongxing1, DU Ran1, JIANG Xiaowei1, ZOU Jiaheng1, TAN Hongnan1,2   

  1. 1. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China;
    2. School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2019-01-07 Revised:2019-04-09 Online:2020-01-15 Published:2019-05-22

大规模异构计算集群的双层作业调度系统

孙震宇1,2, 石京燕1, 孙功星1, 杜然1, 姜晓巍1, 邹佳恒1, 谭宏楠1,2   

  1. 1. 中国科学院高能物理研究所, 北京 100049;
    2. 中国科学院大学 物理科学学院, 北京 100049
  • 作者简介:孙震宇(1990-),男,博士研究生,主研方向为计算集群作业调度;石京燕,副研究员、博士;孙功星,研究员、博士生导师;杜然,助理研究员、博士;姜晓巍,工程师、硕士;邹佳恒,副研究员、博士;谭宏楠,硕士。
  • 基金资助:
    国家自然科学基金(11475210);国家自然科学基金青年基金(11805225)。

Abstract: HTCondor and SLURM computing clusters in high-energy physics computing platforms provide data processing services for many high-energy physics experiments.However,HTCondor is not efficient in parallel job scheduling,and SLURM could not manage massive serial jobs.Also,the overall resource management and scheduling strategies of computing platforms are too simple.To meet the demands of high-energy physics computing clusters running with heavy duties,this paper designs a dual layer job scheduling system,which adds a job management layer on the existing job scheduler.The system is designed to efficiently schedule serial and parallel jobs,ensure fair use of resources between experiment groups,and enable users to implement fine-grained management of jobs.Test results show that the dual layer job scheduling system supports rapid submission of massive high-energy physics jobs,makes full use of resources of a computing platform,and has a high performance in job scheduling.

Key words: computing cluster management, job scheduler, High Throughput Computing(HTC), High Performance Computing(HPC), high energy physics computing

摘要: 高能物理计算平台中的HTCondor和SLURM计算集群为多个高能物理实验提供数据处理服务,然而HTCondor并行作业调度效率较低、SLURM难以应对大量串行作业,且计算平台整体资源管理及调度策略过于简单。为满足高能物理计算集群高负荷运行的需求,在传统作业调度器上增加作业管理层,设计双层作业调度系统,通过高效调度串并行作业并兼顾实验组间资源的使用公平性,实现用户对作业的细粒度管理。测试结果表明,双层作业调度系统支持大批量高能物理作业的快速提交,并充分利用计算平台的总体资源,具有较好的作业调度性能。

关键词: 计算集群管理, 作业调度器, 高通量计算, 高性能计算, 高能物理计算

CLC Number: