作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (7): 59-67. doi: 10.19678/j.issn.1000-3428.0069035

• 热点与综述 • 上一篇    下一篇

一种集成于超算作业调度系统应用的并行参数优化方法

张文帅1, 李会民1,*(), 李京2, 潘必才3   

  1. 1. 中国科学技术大学网络信息中心超级计算中心, 安徽 合肥 230026
    2. 中国科学技术大学计算机科学与技术学院, 安徽 合肥 230026
    3. 中国科学技术大学物理系, 安徽 合肥 230026
  • 收稿日期:2023-12-18 出版日期:2025-07-15 发布日期:2024-06-18
  • 通讯作者: 李会民
  • 基金资助:
    中国科学院A类先导科技专项(XDA19020102)

A Parallel Parameter Optimization Method Integrated with Job Scheduling System for Supercomputing Applications

ZHANG Wenshuai1, LI Huimin1,*(), LI Jing2, PAN Bicai3   

  1. 1. Supercomputing Center, Network Information Center, University of Science and Technology of China, Hefei 230026, Anhui, China
    2. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, Anhui, China
    3. Department of Physics, University of Science and Technology of China, Hefei 230026, Anhui, China
  • Received:2023-12-18 Online:2025-07-15 Published:2024-06-18
  • Contact: LI Huimin

摘要:

随着高性能计算体系结构的发展,软件与硬件都具有多层的并行结构。当不同纵向层级与横向分组的计算任务被划分到不同节点的不同处理器时,存在非常多的分配方式。这些分配方式一般在运行时由用户输入的多个并行参数来确定,并对计算效率影响很大。随着计算规模与复杂度的提升,多个并行参数的可配置空间越来越大,用户越来越难以确定最佳的并行参数值。这类运行时优化问题在科学计算应用中较为普遍,但相关的研究与解决方法比较少见。以VASP(Vienna Ab initio Simulation Package)应用为例,首先分析了该应用的多层并行结构,展示了不同并行参数配置引发的巨大运行速度差异。然后提出了一个基于约化并行效率指标的全自动运行优化方法,其不仅可以帮助用户简单快捷地确定最佳应用并行参数,而且可以帮助用户确定最佳的计算资源使用量,使应用可以高效率地扩展到大规模的并行计算中。最后将该优化方法与计算集群作业调度系统相融合应用于用户提交的真实VASP计算作业。统计结果表明,该方法显著提升了作业运行速度与超算资源的使用效率,具有很好的工程应用前景。

关键词: 并行计算, 作业调度, 运行时优化, 超级计算, VASP应用

Abstract:

High-performance computing architectures have facilitated software and hardware with multi-layer parallel structures. These multi-layered system resources can be assigned to computational tasks distributed across different vertical tiers and horizontal groups through various schemes. The allocation schemes, typically determined at runtime by user-defined parallel parameters, significantly affect computational efficiency. As computational scale and complexity increase, the configurable space for these parallel parameters expands, making it increasingly difficult for users to identify the optimal settings. Although such runtime optimization problems are prevalent in scientific computing applications, related research and effective solutions remain scarce. Using the Vienna Ab initio Simulation Package (VASP) as a case study, this study to analyze its multilayer parallel structure to demonstrate how different parallel parameter configurations can lead to significant variations in computational speed. It then proposes a fully automated optimization method based on a reduced parallel efficiency metric. This approach enables users to quickly determine optimal parallel parameters and identifies the most efficient hardware resource allocation, facilitating effective scaling for large-scale parallel computing. Finally, this study integrates the optimization method with a cluster job scheduling system and applies it to actual VASP calculation jobs submitted by users. Statistical results demonstrate that the proposed method significantly improves job execution speed and enhances the utilization efficiency of supercomputing resources, showing great promise for practical engineering applications.

Key words: parallel computing, job scheduling, runtime optimization, supercomputing, VASP application