Vectorization Optimization of LDS Memory Access for DCU

doi:10.19678/j.issn.1000-3428.0067210

Abstract

Abstract:

In a domestic general-purpose accelerator Deep Computing Unit(DCU), Local Data Shared(LDS) is a key storage component with a lower latency and higher bandwidth than global memory. As heterogeneous programs use LDS more frequently, the low memory access efficiency of LDS has become an important limiting factor in the performance of heterogeneous programs. In addition, owing to bank conflicts in the LDS access process, LDS access must follow certain principles to be used efficiently. When the data access between threads presents overlapping memory access characteristics, access vectorization instructions create delays. To address this problem, an optimization method for the LDS memory access vectorization for the DCU is proposed. This method reduces the number of LDS accesse and time-consuming memory accesse by realizing the vectorization of continuous data access, thereby improving the efficiency of program memory access. On this basis, through the determination of memory access characteristics, an LDS access vectorization method that can effectively address data overlap is proposed, and an efficient LDS memory access technology for domestic general-purpose accelerators is realized to ensure the vectorization method effectively improve the memory access efficiency. The experimental results demonstrate that in the heterogeneous programs using LDS, the program performance is improved by an average of 22.6% after the LDS access vectorization is implemented, which verifies the effectiveness of this study. Simultaneously, the vectorization method can realize the overlapping of memory access data between LDS threads, and improves the performance of heterogeneous programs by an average of 30%.

Key words: Deep Computing Unit(DCU), Local Data Shared(LDS), memory access vectorization, memory access characteristic, bank conflict

摘要：

在深度计算器（DCU）中，本地数据共享（LDS）是相较于全局内存延迟更低、带宽更高的关键存储部件。随着异构程序对LDS的使用越来越频繁，LDS访存效率低下成为限制异构程序性能的重要因素。此外，LDS访问过程中存在bank冲突的特性，使LDS的访问应遵循一定原则才能高效利用，当线程间的数据访问呈现重叠的访存特征时，访问向量化指令会因此产生延迟。针对此问题，提出面向DCU的LDS访存向量化优化方法。通过实现连续数据访问的向量化，减少LDS的访问次数，降低访存耗时，由此提高程序访存效率。在此基础上，通过设计访存特征的判断方法，提出能够有效解决数据重叠的LDS访存向量化方法，实现一种面向国产通用加速器的LDS高效访存技术，确保向量化方法对访存效率的有效提升。实验结果表明：在使用LDS的异构程序中，LDS访存向量化实现后程序性能平均提升了22.6%，验证了所提方法的有效性；同时，向量化方法能够实现LDS线程间访存数据重叠问题的优化，使异构程序得到平均30%的性能提升。

关键词: 深度计算器, 本地数据共享, 访存向量化, 访存特征, bank冲突

Sichi YANG, Rongcai ZHAO, Lin HAN, Hongsheng WANG. Vectorization Optimization of LDS Memory Access for DCU[J]. Computer Engineering, 2024, 50(2): 206-213.

杨思驰, 赵荣彩, 韩林, 王洪生. 面向DCU的LDS访存向量化优化[J]. 计算机工程, 2024, 50(2): 206-213.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067210

http://www.ecice06.com/EN/Y2024/V50/I2/206

Figures/Tables 15

Fig.1 DCU memory structure

Fig.2 Concrete implementation of vectorization

Fig.3 Schematic diagram of reading 4 Byte data without bank conflict

Fig.4 Schematic diagram of reading 8 Byte data without bank conflict

Fig.5 The situation of 12 Byte data overlay access

Fig.6 Example of HIP program

Fig.7 Procedure of LDS memory access instruction vectorization algorithm

Fig.8 Saved instructions on vectorizable instruction set

Fig.9 Final implementation result of vectorizable instruction set

Fig.10 %add3.i184 instruction address calculation tree

Fig.11 Procedure of vectorizable condition judgment algorithm

Fig.12 Comparison of IR instructions before and after vectorization

References 26

1	DUBOIS R, SILVA E G, PARNAUDEAU P. High performance computing of stiff bubble collapse on CPU-GPU heterogeneous platform. Computers & Mathematics with Applications, 2021, 99, 246- 256.
2	李嘉楠, 韩林, 柴赟达. 面向国产平台的LLVM自动向量化移植与优化. 计算机工程, 2022, 48(1): 142- 148. URL
	LI J N, HAN L, CHAI Y D. Automatic vectorization transplant and optimization of LLVM for domestic processors. Computer Engineering, 2022, 48(1): 142- 148. URL
3	胡伟方. 面向DCU的多面体编译优化技术研究[D]. 郑州: 郑州大学, 2021.
	HU W F. Research on polyhedral compilation and optimization techniques for DCU[D]. Zhengzhou: Zhengzhou University, 2021. (in Chinese)
4	BABEJ M, JÄÄSKELÄINEN P. HIPCL: tool for porting CUDA applications to advanced OpenCL platforms through HIP[C]//Proceedings of International Workshop on OpenCL. New York, USA: ACM Press, 2020: 1-3.
5	姚远. SIMD自动向量识别及代码调优技术研究[D]. 郑州: 解放军信息工程大学, 2012.
	YAO Y. Research on automatic SIMD vectorization recognization and code tuning technology[D]. Zhengzhou: PLA Information Engineering University, 2012. (in Chinese)
6	汪梦萱. CPU——GPU异构架构下共享内存管理策略的研究[D]. 北京: 北京工业大学, 2020.
	WANG M X. Research on shared memory management strategy under CPU—GPU heterogeneous architecture[D]. Beijing: Beijing University of Technology, 2020. (in Chinese)
7	SHIROKANEV A S, ANDRIYANOV N A, ILYASOVA N Y. Development of vector algorithm using CUDA technology for three-dimensional retinal laser coagulation process modeling. Computer Optics, 2021, 45(3): 427- 437.
8	王细凯. 基于Bank划分的异构内存访存管理机制[D]. 武汉: 华中科技大学, 2016.
	WANG X K. Heterogeneous memory access management mechanism based on bank partition[D]. Wuhan: Huazhong University of Science and Technology, 2016. (in Chinese)
9	杨世伟, 蒋国平, 宋玉蓉, 等. 基于GPU的稀疏矩阵存储格式优化研究. 计算机工程, 2019, 45(9): 23-31, 39. URL
	YANG S W, JIANG G P, SONG Y R, et al. Research on storage format optimization of sparse matrix based on GPU. Computer Engineering, 2019, 45(9): 23-31, 39. URL
10	YANG Y, XIANG P, KONG J F, et al. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Notices, 2010, 45(6): 86- 97. doi: 10.1145/1809028.1806606
11	王琦, 韩林, 姚金阳, 等. 不充分SIMD向量化技术研究. 计算机应用与软件, 2018, 35(9): 108- 112.
	WANG Q, HAN L, YAO J Y, et al. Research on vectorization technology for insufficient SIMD. Computer Applications and Software, 2018, 35(9): 108- 112.
12	狄棒. 异构系统内存架构的安全与数据一致性问题研究[D]. 长沙: 湖南大学, 2021.
	DI B. Research on security and crash consistency of memory architecture for heterogeneous system[D]. Changsha: Hunan University, 2021. (in Chinese)
13	徐金龙, 赵荣彩, 刘鹏, 等. 程序向量化中非规则访存问题研究. 计算机工程, 2015, 41(12): 86- 90. URL
	XU J L, ZHAO R C, LIU P, et al. Research on irregular memory access problem for programs vectorization. Computer Engineering, 2015, 41(12): 86- 90. URL
14	MEI X X, CHU X W. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(1): 72- 86. doi: 10.1109/TPDS.2016.2549523
15	贺婷. 基于数据级自动向量化的编译优化研究综述. 智能计算机与应用, 2016, 6(6): 68- 71.
	HE T. An overview of compilation and optimization of automatic vector quantization based on data level. Intelligent Computer and Applications, 2016, 6(6): 68- 71.
16	DICKSON N G, KARIMI K, HAMZE F. Importance of explicit vectorization for CPU and GPU software performance. Journal of Computational Physics, 2011, 230(13): 5383- 5398. doi: 10.1016/j.jcp.2011.03.041
17	SU X N, HE C, LIU T Q, et al. Full parallel power flow solution: a GPU-CPU-based vectorization parallelization and sparse techniques for Newton-Raphson implementation. IEEE Transactions on Smart Grid, 2020, 11(3): 1833- 1844. doi: 10.1109/TSG.2019.2943746
18	MOAZENI M, BUI A, SARRAFZADEH M. A memory optimization technique for software-managed scratchpad memory in GPUs[C]//Proceedings of the 7th Symposium on Application Specific Processors. Washington D. C., USA: IEEE Press, 2009: 43-49.
19	梁军, 李威, 肖琳, 等. NVIDIA Tegra K1异构计算平台访存优化研究. 计算机工程, 2016, 42(12): 44- 49. URL
	LIANG J, LI W, XIAO L, et al. Research on memory access optimization of NVIDIA tegra K1 heterogeneous computing platform. Computer Engineering, 2016, 42(12): 44- 49. URL
20	杜晓刚, 党建武, 王阳萍. 基于CUDA的改进互信息并行计算方法. 计算机工程, 2015, 41(12): 288-292, 298. URL
	DU X G, DANG J W, WANG Y P. Improved parallel computation method of mutual information based on compute unified device architecture. Computer Engineering, 2015, 41(12): 288-292, 298. URL
21	原建伟, 李爱国, 李文宇. GPU编程模型中存储体冲突的研究. 河北工业科技, 2013, 30(1): 39-41, 46.
	YUAN J W, LI A G, LI W Y. Study of bank conflict in GPU programming model. Hebei Journal of Industrial Science and Technology, 2013, 30(1): 39-41, 46.
22	张吉赞, 古志民. 多核共享缓存bank冲突分析及其延迟最小化. 计算机学报, 2016, 39(9): 1883- 1899.
	ZHANG J Z, GU Z M. Analyzing bank access conflict and minimizing bank conflict delay for shared cache in multicore. Chinese Journal of Computers, 2016, 39(9): 1883- 1899.
23	ZHANG F, HU C, YIN Q, et al. A GPU based memory optimized parallel method for FFT implementation[EB/OL]. [2022-12-01]. https://arxiv.org/abs/1707.07263.
24	ZHANG Y N, QIAN H Y. Porting and optimizing G-BLASTN to the ROCm-based supercomputer[C]//Proceedings of International Conference on Computer Science and Management Technology. Washington D. C., USA: IEEE Press, 2020: 73-77.
25	赵志建. 基于CUDA并行优化的矩阵相乘算法研究. 智能计算机与应用, 2022, 12(11): 192- 196.
	ZHAO Z J. Research on matrix multiplication algorithm based on CUDA parallel optimization. Intelligent Computer and Applications, 2022, 12(11): 192- 196.
26	ZHAO T, BASU P, WILLIAMS S W, et al. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs[C]//Proceedings of HiPC 2019. New York, ACM Press: 2019: 1-10.

Please choose a citation manager

Content to export