Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2024, Vol. 50 ›› Issue (2): 206-213. doi: 10.19678/j.issn.1000-3428.0067210

• Computer Architecture and Software Technology • Previous Articles     Next Articles

Vectorization Optimization of LDS Memory Access for DCU

Sichi YANG1,*(), Rongcai ZHAO1,2, Lin HAN1,2, Hongsheng WANG2   

  1. 1. School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, Henan, China
    2. National Supercomputing Center in Zhengzhou, Zhengzhou 450000, Henan, China
  • Received:2023-03-20 Online:2024-02-15 Published:2023-06-26
  • Contact: Sichi YANG

面向DCU的LDS访存向量化优化

杨思驰1,*(), 赵荣彩1,2, 韩林1,2, 王洪生2   

  1. 1. 郑州大学计算机与人工智能学院, 河南 郑州 450000
    2. 国家超级计算郑州中心, 河南 郑州 450000
  • 通讯作者: 杨思驰
  • 基金资助:
    河南省重大科技专项(221100210600)

Abstract:

In a domestic general-purpose accelerator Deep Computing Unit(DCU), Local Data Shared(LDS) is a key storage component with a lower latency and higher bandwidth than global memory. As heterogeneous programs use LDS more frequently, the low memory access efficiency of LDS has become an important limiting factor in the performance of heterogeneous programs. In addition, owing to bank conflicts in the LDS access process, LDS access must follow certain principles to be used efficiently. When the data access between threads presents overlapping memory access characteristics, access vectorization instructions create delays. To address this problem, an optimization method for the LDS memory access vectorization for the DCU is proposed. This method reduces the number of LDS accesse and time-consuming memory accesse by realizing the vectorization of continuous data access, thereby improving the efficiency of program memory access. On this basis, through the determination of memory access characteristics, an LDS access vectorization method that can effectively address data overlap is proposed, and an efficient LDS memory access technology for domestic general-purpose accelerators is realized to ensure the vectorization method effectively improve the memory access efficiency. The experimental results demonstrate that in the heterogeneous programs using LDS, the program performance is improved by an average of 22.6% after the LDS access vectorization is implemented, which verifies the effectiveness of this study. Simultaneously, the vectorization method can realize the overlapping of memory access data between LDS threads, and improves the performance of heterogeneous programs by an average of 30%.

Key words: Deep Computing Unit(DCU), Local Data Shared(LDS), memory access vectorization, memory access characteristic, bank conflict

摘要:

在深度计算器(DCU)中,本地数据共享(LDS)是相较于全局内存延迟更低、带宽更高的关键存储部件。随着异构程序对LDS的使用越来越频繁,LDS访存效率低下成为限制异构程序性能的重要因素。此外,LDS访问过程中存在bank冲突的特性,使LDS的访问应遵循一定原则才能高效利用,当线程间的数据访问呈现重叠的访存特征时,访问向量化指令会因此产生延迟。针对此问题,提出面向DCU的LDS访存向量化优化方法。通过实现连续数据访问的向量化,减少LDS的访问次数,降低访存耗时,由此提高程序访存效率。在此基础上,通过设计访存特征的判断方法,提出能够有效解决数据重叠的LDS访存向量化方法,实现一种面向国产通用加速器的LDS高效访存技术,确保向量化方法对访存效率的有效提升。实验结果表明:在使用LDS的异构程序中,LDS访存向量化实现后程序性能平均提升了22.6%,验证了所提方法的有效性;同时,向量化方法能够实现LDS线程间访存数据重叠问题的优化,使异构程序得到平均30%的性能提升。

关键词: 深度计算器, 本地数据共享, 访存向量化, 访存特征, bank冲突