作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 46-54. doi: 10.19678/j.issn.1000-3428.0066860

• 计算机系统前沿技术 • 上一篇    下一篇

申威处理器上数据流运行时系统的设计与实现

张鹏飞, 陈俊仕, 郑重, 沈沛祺, 安虹, 许乐   

  1. 中国科学技术大学 计算机科学与技术学院, 合肥 230026
  • 收稿日期:2023-02-06 出版日期:2023-12-15 发布日期:2023-12-14
  • 作者简介:

    张鹏飞(1997—),男,硕士研究生,主研方向为高性能计算

    陈俊仕,博士、特任副研究员

    郑重,硕士研究生

    沈沛祺,硕士研究生

    安虹,教授、博士、博士生导师

    许乐,硕士研究生

  • 基金资助:
    国家自然科学基金(62102389)

Design and Implementation of Data-Flow Runtime System on Shenwei Processor

Pengfei ZHANG, Junshi CHEN, Zhong ZHENG, Peiqi SHEN, Hong AN, Le XU   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
  • Received:2023-02-06 Online:2023-12-15 Published:2023-12-14

摘要:

我国自主研发的新一代神威异构众核计算平台主要采用athread异构编程方法,athread异构编程属于大同步并行模型,难以充分挖掘程序中的细粒度并行性,其采用的同步方式难以实现众核上的任务负载均衡。数据流并行编程模型因其天然并行性、点对点同步的特点能够很好地解决上述问题。基于Codelet程序执行模型和申威主从核架构特点,设计并实现面向申威处理器的数据流运行时系统swTasklet,通过对Codelet功能的进一步细化和对Codelet机器模型到主从核的映射,避免从核阵列上的同步操作,减少同步开销;由主核完成从核计算任务的调度分配,将计算和同步操作分离,保证运行时系统可以和从核计算库的共用。实验以NPB LU程序和向量-向量加作为测试用例,采用相同的优化方法分别对swTasklet和athread实现进行并行化。实验结果表明:在规模较大情况下,LU程序的swTasklet实现版本比athread版本快16%,向量-向量加swTasklet实现版本比athread版本快1倍;使用swTasklet实现的LU并行版本较主核本取得了平均8倍以上的加速,而向量-向量加swTasklet版本较主核版本取得30倍左右的加速。

关键词: 申威异构处理器, 数据流运行时系统, Codelet程序执行模型, 并行编程模型, 众核加速

Abstract:

The domestic Sunway heterogeneous many-core computing platform uses athread heterogeneous programming method. Exploring fine-grain parallelism in athread is challenging, as its synchronization method struggles to maintain an effective workload balancing among tasks and computing cores. The data-flow parallel programming model solves these problems well because of its natural parallelism and point-to-point synchronization. swTasklet, a data-flow runtime designed on Sunway, is developed based on the Codelet program execution model. It avoids using synchronous operation on the slave cores owing to its refined Codelet function mapping of Codelet machine model to master-slave kernel. The scheduling of tasks is completed by the master core, ensuring compatibility of the runtime and libraries. In order to evaluate the efficiency of swTasklet, NPB LU and vector-vector addition are used as case studies. Furthermore, when implemented in swTasklet, LU achieves an average speedup of 8, and vector-vector addition achieves an average speedup of 30. On a large scale, the swTasklet implementation of the LU program is 16% faster than athread version, and the vector-vector addition is twice as fast as athread version.

Key words: Shenwei heterogeneous processor, data-flow runtime system, Codelet program execution model, parallel programming model, many-core acceleration