作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (4): 43-51. doi: 10.19678/j.issn.1000-3428.0066025

• 热点与综述 • 上一篇    下一篇

基于分布式数据集的并行计算框架内存优化方法

夏立斌1,2, 刘晓宇1,2, 姜晓巍1,2, 孙功星1   

  1. 1. 中国科学院高能物理研究所, 北京 100049;
    2. 中国科学院大学, 北京 100049
  • 收稿日期:2022-10-18 修回日期:2022-12-12 发布日期:2023-01-12
  • 作者简介:夏立斌(1995-),男,博士研究生,主研方向为分布式计算;刘晓宇,博士研究生;姜晓巍,工程师、博士研究生;孙功星,研究员、博士。
  • 基金资助:
    国家自然科学基金(12275295)。

Memory Optimization Method for Parallel Computing Framework Based on Distributed Dataset

XIA Libin1,2, LIU Xiaoyu1,2, JIANG Xiaowei1,2, SUN Gongxing1   

  1. 1. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2022-10-18 Revised:2022-12-12 Published:2023-01-12

摘要: 随着科学计算和人工智能技术的快速发展,分布式环境下的并行计算已成为解决大规模理论计算和数据处理问题的重要手段。内存容量的提高以及迭代算法的广泛应用,使得以Spark为代表的内存计算技术愈发成熟。但是,当前主流的分布式内存模型和计算框架难以兼顾易用性和计算性能,并且在数据格式定义、内存分配、内存使用效率等方面存在不足。提出一种基于分布式数据集的并行计算方法,分别从模型理论和系统开销两个角度对内存计算进行优化。在理论上,通过对计算过程进行建模分析,以解决Spark在科学计算环境下表达能力不足的问题,同时给出计算框架的开销模型,为后续性能优化提供支持。在系统上,提出一种框架级的内存优化方法,该方法主要包括对跨语言分布式内存数据集的重构、分布式共享内存的管理、消息传递过程的优化等模块。实验结果表明,基于该优化方法实现的并行计算框架可以显著提升数据集的内存分配效率,减少序列化/反序列化开销,缓解内存占用压力,应用测试的执行时间相比Spark减少了69%~92%。

关键词: 内存计算, 并行计算, 内存优化, Spark框架, 消息传递接口

Abstract: With the rapid development of scientific computing and artificial intelligence technology, parallel computing in distributed environment has become an important method for solving large-scale theoretical computing and data processing problems. The improvement of memory capacity and the wide application of iterative algorithms make memory computing technology represented by Spark more mature. However, the current mainstream distributed memory model and computing framework are difficult to consider in terms of ease of use and computing performance. In addition, they have deficiencies in data format definition, memory allocation, and memory utilization efficiency. A parallel computing method is proposed based on distributed datasets, which optimizes memory computing in terms of model theory and system overhead. In theory, through modeling and analysis of the calculation process, the limitations in the expression ability of Spark in the scientific computing environment is solved, and the overhead model of the computing framework supports the subsequent performance optimization. On the system, a frame-level memory optimization method is proposed, which mainly includes modules for the reconstruction of cross-language distributed memory datasets, management of distributed shared memory, and optimization of message delivery process. The experimental results show that the parallel computing framework based on this optimization method significantly improved the memory allocation efficiency of datasets, reduced the serialization/deserialization overhead, and alleviated the memory occupation pressure. The execution time of the application testing was reduced by 69%-92% compared with that of Spark.

Key words: memory computing, parallel computing, memory optimization, Spark framework, Message Passing Interface(MPI)

中图分类号: