Design and Implementation of Data-Flow Runtime System on Shenwei Processor

doi:10.19678/j.issn.1000-3428.0066860

Abstract

Abstract:

The domestic Sunway heterogeneous many-core computing platform uses athread heterogeneous programming method. Exploring fine-grain parallelism in athread is challenging, as its synchronization method struggles to maintain an effective workload balancing among tasks and computing cores. The data-flow parallel programming model solves these problems well because of its natural parallelism and point-to-point synchronization. swTasklet, a data-flow runtime designed on Sunway, is developed based on the Codelet program execution model. It avoids using synchronous operation on the slave cores owing to its refined Codelet function mapping of Codelet machine model to master-slave kernel. The scheduling of tasks is completed by the master core, ensuring compatibility of the runtime and libraries. In order to evaluate the efficiency of swTasklet, NPB LU and vector-vector addition are used as case studies. Furthermore, when implemented in swTasklet, LU achieves an average speedup of 8, and vector-vector addition achieves an average speedup of 30. On a large scale, the swTasklet implementation of the LU program is 16% faster than athread version, and the vector-vector addition is twice as fast as athread version.

Key words: Shenwei heterogeneous processor, data-flow runtime system, Codelet program execution model, parallel programming model, many-core acceleration

摘要：

我国自主研发的新一代神威异构众核计算平台主要采用athread异构编程方法，athread异构编程属于大同步并行模型，难以充分挖掘程序中的细粒度并行性，其采用的同步方式难以实现众核上的任务负载均衡。数据流并行编程模型因其天然并行性、点对点同步的特点能够很好地解决上述问题。基于Codelet程序执行模型和申威主从核架构特点，设计并实现面向申威处理器的数据流运行时系统swTasklet，通过对Codelet功能的进一步细化和对Codelet机器模型到主从核的映射，避免从核阵列上的同步操作，减少同步开销；由主核完成从核计算任务的调度分配，将计算和同步操作分离，保证运行时系统可以和从核计算库的共用。实验以NPB LU程序和向量-向量加作为测试用例，采用相同的优化方法分别对swTasklet和athread实现进行并行化。实验结果表明：在规模较大情况下，LU程序的swTasklet实现版本比athread版本快16%，向量-向量加swTasklet实现版本比athread版本快1倍；使用swTasklet实现的LU并行版本较主核本取得了平均8倍以上的加速，而向量-向量加swTasklet版本较主核版本取得30倍左右的加速。

关键词: 申威异构处理器, 数据流运行时系统, Codelet程序执行模型, 并行编程模型, 众核加速

Pengfei ZHANG, Junshi CHEN, Zhong ZHENG, Peiqi SHEN, Hong AN, Le XU. Design and Implementation of Data-Flow Runtime System on Shenwei Processor[J]. Computer Engineering, 2023, 49(12): 46-54.

张鹏飞, 陈俊仕, 郑重, 沈沛祺, 安虹, 许乐. 申威处理器上数据流运行时系统的设计与实现[J]. 计算机工程, 2023, 49(12): 46-54.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0066860

http://www.ecice06.com/EN/Y2023/V49/I12/46

Figures/Tables 14

Fig.1 Shenwei heterogeneous many-core processor architecture

Fig.2 Shenwei heterogeneous acceleration programming

Fig.3 Codelet program execution model

Fig.4 Codelet abstract machine model

Fig.5 Abstract machine model mapping

Fig.6 Main classes and their programming interfaces in swTasklet

Fig.7 Master core run mode in swTasklet

Fig.8 Master-slave coordination run mode in swTasklet

Fig.9 The speedup of vector-vector add achieved with different scales and fixed task number 64

Fig.10 The speedup of vector-vector add achieves with different task numbers

Fig.11 Data dependencies in the upper and lower triangular systems

Fig.12 LU task dependencies in fork-join model

Fig.13 LU task dependencies in dataflow model

Fig.14 The realize speedup of LU achieves at different problem sizes

References 25

1	VALIANT L G. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8): 103- 111. doi: 10.1145/79173.79181
2	OpenMP. OpenMP 4.0 complete specifications[EB/OL]. [2023-01-02]. http://www.openmp.org/wpcontent/uploads/OpenMP4.0.0.pdf.
3	WIENKE S, SPRINGER P, TERBOVEN C, et al. OpenACC—first experiences with real-world applications[M]//KAKLAMANIS C, PAPATHEODOROU T, SPIRAKIS P G. Euro-Par 2012 parallel processing. Berlin, Germany: Springer, 2012: 859-870.
4	BLUMOFE R D, JOERG C F, KUSZMAUL B C, et al. Cilk: an efficient multithreaded runtime system. ACM SIGPLAN Notices, 1995, 30(8): 207- 216. doi: 10.1145/209937.209958
5	DURAN A, FERRER R, AYGUADÉ E, et al. A proposal to extend the OpenMP tasking model with dependent tasks. International Journal of Parallel Programming, 2009, 37(3): 292- 305. doi: 10.1007/s10766-009-0101-1
6	VOSS M, ASENJO ZEGERS R, REINDERS J. Pro TBB: C++ parallel programming with threading building blocks. Berkeley, USA: Apress, 2019.
7	REINDERS J, ASHBAUGH B, BRODMAN J, et al. Data parallel C++: mastering DPC++ for programming of heterogeneous systems using C++ and SYCL. Berkeley, USA: Apress, 2021.
8	HALBWACHS N, CASPI P, RAYMOND P, et al. The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 1991, 79(9): 1305- 1320. doi: 10.1109/5.97300
9	LEE E A, MESSERSCHMITT D G. Synchronous data flow. Proceedings of the IEEE, 1987, 75(9): 1235- 1245. doi: 10.1109/PROC.1987.13876
10	LEE E A, MESSERSCHMITT D G. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 1987, 36(1): 24- 35.
11	RASKAR S. Dataflow software pipelining for Codelet model using hardware-software co-design[D]. Newark, USA: University of Delaware, 2021.
12	PEI S W, WANG J K, CUI W Y, et al. Codelet scheduling by genetic algorithm[C]//Proceedings of IEEE Trustcom/BigDataSE/ISPA. Washington D. C., USA: IEEE Press, 2017: 1492-1499.
13	JOSHUA S. DARTS: a runtime based on the Codelet execution model[D]. Newark, USA: University of Delaware, 2014.
14	ZUCKERMAN S, SUETTERLEIN J, KNAUERHASE R, et al. Using a "Codelet" program execution model for exascale machines: position paper[C]//Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York, USA: ACM Press, 2011: 64-69.
15	CHEN C, WU Y, SUETTERLEIN J, et al. Automatic locality exploitation in the Codelet model[C]//Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. Washington D. C., USA: IEEE Press, 2013: 853-862.
16	ZUCKERMAN S, LANDWEHR A, LIVINGSTON K, et al. Toward a self-aware Codelet execution model[C]//Proceedings of the 4th Workshop on Data-Flow Execution Models for Extreme Scale Computing. Washington D. C., USA: IEEE Press, 2015: 26-29.
17	高光荣. 大数据的流动之美——数据流与大数据: 挑战与机遇. 中国计算机学会通讯, 2013, 9(12): 16- 18.
	GAO G R. The joy of big data flow——dataflow and big data: challenges and opportunities. Communications of the CCF, 2013, 9(12): 16- 18.
18	LAUDERDALE C, KHAN R. Towards a Codelet-based runtime for exascale computing: position paper[C]//Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York, USA: ACM Press, 2012: 21-26.
19	HOQUE R, HERAULT T, BOSILCA G, et al. Dynamic task discovery in PaRSEC: a data-flow task-based runtime[C]//Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. New York, USA: ACM Press, 2017: 1-8.
20	HUANG T W, LIN D L, LIN C X, et al. Taskflow: a lightweight parallel and heterogeneous task graph computing system. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(6): 1303- 1320. doi: 10.1109/TPDS.2021.3104255
21	HUANG T W, LIN Y B, LIN C X, et al. Cpp-taskflow: a general-purpose parallel task programming system at scale. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 40(8): 1687- 1700. doi: 10.1109/TCAD.2020.3025075
22	高希然. "神威·太湖之光"上任务图并行调度优化研究[D]. 青岛: 山东科技大学, 2018.
	GAO X R. Research on parallel scheduling optimization of task graph on "Sunway TaihuLight"[D]. Qingdao: Shandong University of Science and Technology, 2018. (in Chinese)
23	SU Z C, CHEN J S, LIN H, et al. A dataflow-based runtime support on a 100P actual system[C]//Proceedings of IEEE International Symposium on Parallel and Distributed Processing with Applications and IEEE International Conference on Ubiquitous Computing and Communications. Washington D. C., USA: IEEE Press, 2018: 599-606.
24	苏志超. 神威·太湖之光上数据流编程模型的设计与实现[D]. 合肥: 中国科学技术大学, 2018.
	SU Z C. Design and implementation of data stream programming model on Sunway TaihuLight[D]. Hefei: University of Science and Technology of China, 2018. (in Chinese)
25	VANDERWIJNGAART R, BIEGEL B A. NAS parallel benchmarks[EB/OL]. [2023-01-02]. https://www.zhangqiaokeyan.com/ntis-science-report_other_thesis/02071165729.html.

Please choose a citation manager

Content to export