便笺式存储器中一种新颖的交错映射数据布局

doi:10.19678/j.issn.1000-3428.0067271

计算机工程 ›› 2024, Vol. 50 ›› Issue (5): 33-40. doi: 10.19678/j.issn.1000-3428.0067271

便笺式存储器中一种新颖的交错映射数据布局

曾灵灵¹, 张敦博¹, 沈立¹, 窦强²

1. 国防科技大学计算机学院, 湖南长沙 410073;
2. 飞腾信息技术有限公司, 天津 300457

收稿日期:2023-03-27 修回日期:2023-07-23 出版日期:2024-05-15 发布日期:2023-09-05
通讯作者: 曾灵灵,E-mail:lling_zeng@nudt.edu.cn E-mail:lling_zeng@nudt.edu.cn
基金资助:
国家自然科学基金面上项目(61972407)。

A Novel Interleaved Mapping Data Layout in Scratch Pad Memory

ZENG Lingling¹, ZHANG Dunbo¹, SHEN Li¹, DOU Qiang²

1. School of Computer, National University of Defense Technology, Changsha 410073, Hunan, China;
2. Phytium Technology Co., Ltd., Tianjin 300457, China

Received:2023-03-27 Revised:2023-07-23 Online:2024-05-15 Published:2023-09-05
Contact: 曾灵灵,E-mail:lling_zeng@nudt.edu.cn E-mail:lling_zeng@nudt.edu.cn

摘要/Abstract

摘要： 现代计算机一直沿用传统的线性数据布局模式,该模式允许对使用行主序模式存储的二维矩阵进行高效的行优先数据访问,但是增加了高效执行列优先数据访问的复杂性,造成列优先访问的空间局部性较差。改善列优先数据访存效率的常见解决方案是对原始矩阵进行预先转置操作,将列优先访问的复杂性集中在一次矩阵转置运算中,然而矩阵转置不仅会引入额外的数据传输操作,而且会消耗额外的存储空间用于存储转置后的矩阵。为了在不引入额外开销的情况下使行优先与列优先数据访问具有同样高效的访存效率,提出一种新颖的交错映射(IM)数据布局,同时在不改变便笺式存储器(SPM)内部结构的基础上,在SPM的输入和输出(I/O)接口处添加循环移位单元和译码单元2个新组件,实现交错映射数据布局并定制访存指令,使程序员可通过定制的访存指令充分利用该数据布局。实验结果表明,应用交错映射数据布局的SPM在仅额外增加了1.73%面积开销的情况下获得了1.4倍的加速。

关键词: 矩阵转置, 单指令多数据, 便笺式存储器, 数据布局, 静态随机存储器

Abstract: Modern computers adhere to the classical linear data layout mode, which enables efficient row-major access to Two-Dimensional (2D) matrices stored in the row-major order. However, this complicates the efficient execution of column-major data access, thus resulting in unsatisfactory spatial locality. The efficiency of column-major data access is typically improved by pre-transposing the original matrix and concentrating the complexity of column-major access into a single matrix transposition operation. Nevertheless, matrix transposition introduces additional data transfer operations and requires additional memory to store the transposed matrix. To achieve equally efficient access to row-major and column-major data without introducing additional overhead, a novel Interleaved Mapping (IM) data layout is proposed. Without altering the internal structure of the Scratch Pad Memory (SPM), this layout is implemented by adding two new components—a Cyclic Shift Unit and a Decoder Unit—at the Input and Output (I/O) interfaces of the SPM. Additionally, customized memory access instructions are developed, thus enabling programmers to leverage the data layout fully via these instructions. Experimental results show that the SPM utilizing the IM data layout increases the speedup by 1.4 times while incurring 1.73% additional area overhead.

Key words: matrix transposition, Single Instruction Multiple Data(SIMD), Scratch Pad Memory(SPM), data layout, Static Random Access Memory(SRAM)

中图分类号:

TP302.7

曾灵灵, 张敦博, 沈立, 窦强. 便笺式存储器中一种新颖的交错映射数据布局[J]. 计算机工程, 2024, 50(5): 33-40.

ZENG Lingling, ZHANG Dunbo, SHEN Li, DOU Qiang. A Novel Interleaved Mapping Data Layout in Scratch Pad Memory[J]. Computer Engineering, 2024, 50(5): 33-40.

https://www.ecice06.com/CN/Y2024/V50/I5/33

参考文献

[1] WINDLEY P F. Transposing matrices in a digital computer[J]. The Computer Journal, 1959, 2(1):47-48.
[2] GODARD P, LOECHNER V, BASTOUL C. Efficient out-of-core and out-of-place rectangular matrix transposition and rotation[J]. IEEE Transactions on Computers, 2021, 70(11):1942-1948.
[3] EKLUNDH J O. A fast computer method for matrix transposing[J]. IEEE Transactions on Computers, 1972, 21(7):801-803.
[4] KAUSHIK S D, HUANG C H, JOHNSON R W, et al. Efficient transposition algorithms for large matrices[C]//Proceedings of 1993 ACM/IEEE conference on Supercomputing. New York, USA:ACM Press, 1993:656-665.
[5] ZEKRI A S. Restructuring and implementations of 2D matrix transpose algorithm using SSE4 vector instructions[C]//Proceedings of International Conference on Applied Research in Computer Science and Engineering. Washington D.C., USA:IEEE Press, 2015:1-7.
[6] GUSTAVSON F, KARLSSON L, KÅGSTRÖM B. Parallel and cache-efficient in-place matrix storage format conversion[J]. ACM Transactions on Mathematical Software, 38(3):17.
[7] CATANZARO B, KELLER A, GARLAND M. A decomposition for in-place matrix transposition[J]. ACM SIGPLAN Notices, 2014, 49(8):193-206.
[8] GOMEZ-LUNA J, SUNG I J, CHANG L W, et al. In-place matrix transposition on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(3):776-788.
[9] ZHANG B, MA Z G, LUO W. Parallel pipelined architecture and algorithm for matrix transposition using registers[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2022, 69(3):1627-1631.
[10] BANAKAR R, STEINKE S, LEE B S, et al. Scratchpad memory:a design alternative for cache on-chip memory in embedded systems[C]//Proceedings of the 10th International Symposium on Hardware/Software Codesign. New York, USA:ACM Press, 2002:73-78.
[11] CHEN S M, WANG Y H, LIU S, et al. FT-Matrix:a coordination-aware architecture for signal processing[J]. IEEE Micro, 2014, 34(6):64-73.
[12] LIAO H, TU J J, XIA J, et al. DaVinci:a scalable architecture for neural network computing[C]//Proceedings of IEEE Hot Chips 31 Symposium(HCS). Washington D.C., USA:IEEE Press, 2019:1-44.
[13] BERMAN M F. A method for transposing a matrix[J]. Journal of the ACM, 1958, 5(4):383-384.
[14] DOW M. Transporting a matrix on a vector computer[J]. Parallel Computing, 1995, 21(12):1997-2005.
[15] CHATTERJEE S, SEN S. Cache-efficient matrix transposition[C]//Proceedings the 6th International Symposium on High-Performance Computer Architecture. Washington D.C., USA:IEEE Press, 2000:195-205.
[16] RUETSCH G, MICIKEVICIUS P. Optimizing matrix transpose in CUDA[EB/OL].[2023-02-11]. https://dmacssite.github.io/materials/MatrixTranspose.pdf.
[17] AGGARWAL A, VITTER J S. The input/output complexity of sorting and related problems[J]. Communications of the ACM, 1988, 31(9):1116-1127.
[18] CHEN P M, LEE E K, GIBSON G A, et al. RAID:high-performance, reliable secondary storage[J]. ACM Computing Surveys, 1994, 26(2):145-185.
[19] WANG Y H, LI C, LIU C, et al. Advancing DSP into HPC, AI, and beyond:challenges, mechanisms, and future directions[J]. CCF Transactions on High Performance Computing, 2021, 3(1):114-125.
[20] BLACKFORD L S, PETITET A, POZO R, et al. An updated set of Basic Linear Algebra Subprograms(BLAS)[J]. ACM Transactions on Mathematical Software, 2002, 28(2):135-151.
[21] 雷元武, 陈小文, 彭元喜. DSP芯片中的高能效FFT加速器[J]. 计算机研究与发展, 2016, 53(7):1438-1446. LEI Y W, CHEN X W, PENG Y X. A high energy efficiency FFT accelerator on DSP chip[J]. Journal of Computer Research and Development, 2016, 53(7):1438-1446.(in Chinese)
[22] VAN LOAN C F. Computational frameworks for the fast Fourier transform[M]. Philadelphia, USA:Society for Industrial and Applied Mathematics, 1992.
[23] SPINEAN B, GAYDADJIEV G. Implementation study of FFT on multi-lane vector processors[C]//Proceedings of the 15th Euromicro Conference on Digital System Design. Washington D.C., USA:IEEE Press, 2012:815-822.
[24] RANEY R K, RUNGE H, BAMLER R, et al. Precision SAR processing using chirp scaling[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(4):786-799.
[25] 林桐, 谢宜壮, 刘伟. 实时SAR成像系统矩阵原位转置的实现[J]. 计算机工程, 2013, 39(6):319-321. LIN T, XIE Y Z, LIU W. Implementation of matrix in-place transpose for real-time SAR imaging system[J]. Computer Engineering, 2013, 39(6):319-321.(in Chinese)

选择文件类型/文献管理软件名称

选择包含的内容

便笺式存储器中一种新颖的交错映射数据布局

A Novel Interleaved Mapping Data Layout in Scratch Pad Memory

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	周春雷, 宋继勐, 沈子奇, 余晗, 雷杰, 林兵. 数联网标识解析系统中的标识数据布局策略[J]. 计算机工程, 2024, 50(6): 311-320.
[2]	李博, 黄东强, 贾金芳, 吴利, 王晓英, 黄建强. 基于CPU与GPU的异构模板计算优化研究[J]. 计算机工程, 2023, 49(4): 131-137.
[3]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[4]	卢嘉嘉, 杜育松. 整数上离散高斯取样的常数时间实现方法[J]. 计算机工程, 2020, 46(8): 119-123.
[5]	尚蕾, 刘茜萍. 基于任务分配和数据集副本的科学工作流数据布局[J]. 计算机工程, 2020, 46(5): 122-130,138.
[6]	彭振,吴百锋. 基于数据并行的碰撞检测[J]. 计算机工程, 2017, 43(9): 1-6.
[7]	韩林,高伟,王冬,王鹏翔,李颖颖. 一种单指令多数据向量化归约方法[J]. 计算机工程, 2017, 43(7): 9-14.
[8]	沈良好,吴庆波,杨沙洲. 基于Ceph的分布式存储节能技术研究[J]. 计算机工程, 2015, 41(8): 13-17.
[9]	姚远, 赵荣彩. 基于编译指示的向量化方法[J]. 计算机工程, 2012, 38(12): 272-275.
[10]	陈源, 王元钦, 董绪荣. 基于SIMD架构的相干累加运算优化方法[J]. 计算机工程, 2011, 37(20): 268-270.
[11]	江健勇, 李春强, 胡军山. 基于CK-CPU的Linux2.6实时性能优化[J]. 计算机工程, 2011, 37(17): 236-238.
[12]	严忠林. C to Java自动转换系统中C指针的实现[J]. 计算机工程, 2011, 37(16): 62-64.
[13]	谢宜壮;龙腾. 基于FPGA内嵌入式处理器的二维脉冲压缩[J]. 计算机工程, 2010, 36(5): 248-249,.
[14]	卜士喜, 竺红卫. 软件SIMD的研究及应用[J]. 计算机工程, 2010, 36(19): 53-55.
[15]	柯剑;朱旭东;那文武;许鲁. 动态地址映射虚拟存储系统[J]. 计算机工程, 2009, 35(16): 17-19.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

便笺式存储器中一种新颖的交错映射数据布局

A Novel Interleaved Mapping Data Layout in Scratch Pad Memory

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价