[1] WINDLEY P F. Transposing matrices in a digital computer[J]. The Computer Journal, 1959, 2(1):47-48. [2] GODARD P, LOECHNER V, BASTOUL C. Efficient out-of-core and out-of-place rectangular matrix transposition and rotation[J]. IEEE Transactions on Computers, 2021, 70(11):1942-1948. [3] EKLUNDH J O. A fast computer method for matrix transposing[J]. IEEE Transactions on Computers, 1972, 21(7):801-803. [4] KAUSHIK S D, HUANG C H, JOHNSON R W, et al. Efficient transposition algorithms for large matrices[C]//Proceedings of 1993 ACM/IEEE conference on Supercomputing. New York, USA:ACM Press, 1993:656-665. [5] ZEKRI A S. Restructuring and implementations of 2D matrix transpose algorithm using SSE4 vector instructions[C]//Proceedings of International Conference on Applied Research in Computer Science and Engineering. Washington D.C., USA:IEEE Press, 2015:1-7. [6] GUSTAVSON F, KARLSSON L, KÅGSTRÖM B. Parallel and cache-efficient in-place matrix storage format conversion[J]. ACM Transactions on Mathematical Software, 38(3):17. [7] CATANZARO B, KELLER A, GARLAND M. A decomposition for in-place matrix transposition[J]. ACM SIGPLAN Notices, 2014, 49(8):193-206. [8] GOMEZ-LUNA J, SUNG I J, CHANG L W, et al. In-place matrix transposition on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(3):776-788. [9] ZHANG B, MA Z G, LUO W. Parallel pipelined architecture and algorithm for matrix transposition using registers[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2022, 69(3):1627-1631. [10] BANAKAR R, STEINKE S, LEE B S, et al. Scratchpad memory:a design alternative for cache on-chip memory in embedded systems[C]//Proceedings of the 10th International Symposium on Hardware/Software Codesign. New York, USA:ACM Press, 2002:73-78. [11] CHEN S M, WANG Y H, LIU S, et al. FT-Matrix:a coordination-aware architecture for signal processing[J]. IEEE Micro, 2014, 34(6):64-73. [12] LIAO H, TU J J, XIA J, et al. DaVinci:a scalable architecture for neural network computing[C]//Proceedings of IEEE Hot Chips 31 Symposium(HCS). Washington D.C., USA:IEEE Press, 2019:1-44. [13] BERMAN M F. A method for transposing a matrix[J]. Journal of the ACM, 1958, 5(4):383-384. [14] DOW M. Transporting a matrix on a vector computer[J]. Parallel Computing, 1995, 21(12):1997-2005. [15] CHATTERJEE S, SEN S. Cache-efficient matrix transposition[C]//Proceedings the 6th International Symposium on High-Performance Computer Architecture. Washington D.C., USA:IEEE Press, 2000:195-205. [16] RUETSCH G, MICIKEVICIUS P. Optimizing matrix transpose in CUDA[EB/OL].[2023-02-11]. https://dmacssite.github.io/materials/MatrixTranspose.pdf. [17] AGGARWAL A, VITTER J S. The input/output complexity of sorting and related problems[J]. Communications of the ACM, 1988, 31(9):1116-1127. [18] CHEN P M, LEE E K, GIBSON G A, et al. RAID:high-performance, reliable secondary storage[J]. ACM Computing Surveys, 1994, 26(2):145-185. [19] WANG Y H, LI C, LIU C, et al. Advancing DSP into HPC, AI, and beyond:challenges, mechanisms, and future directions[J]. CCF Transactions on High Performance Computing, 2021, 3(1):114-125. [20] BLACKFORD L S, PETITET A, POZO R, et al. An updated set of Basic Linear Algebra Subprograms(BLAS)[J]. ACM Transactions on Mathematical Software, 2002, 28(2):135-151. [21] 雷元武, 陈小文, 彭元喜. DSP芯片中的高能效FFT加速器[J]. 计算机研究与发展, 2016, 53(7):1438-1446. LEI Y W, CHEN X W, PENG Y X. A high energy efficiency FFT accelerator on DSP chip[J]. Journal of Computer Research and Development, 2016, 53(7):1438-1446.(in Chinese) [22] VAN LOAN C F. Computational frameworks for the fast Fourier transform[M]. Philadelphia, USA:Society for Industrial and Applied Mathematics, 1992. [23] SPINEAN B, GAYDADJIEV G. Implementation study of FFT on multi-lane vector processors[C]//Proceedings of the 15th Euromicro Conference on Digital System Design. Washington D.C., USA:IEEE Press, 2012:815-822. [24] RANEY R K, RUNGE H, BAMLER R, et al. Precision SAR processing using chirp scaling[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(4):786-799. [25] 林桐, 谢宜壮, 刘伟. 实时SAR成像系统矩阵原位转置的实现[J]. 计算机工程, 2013, 39(6):319-321. LIN T, XIE Y Z, LIU W. Implementation of matrix in-place transpose for real-time SAR imaging system[J]. Computer Engineering, 2013, 39(6):319-321.(in Chinese) |