[1] Barker K J, Davis K, Hoisie A, et al. Entering the Petaflop Era: The Architecture and Performance of Roadrunner[C]//Proc. of the ACM/IEEE Conference on Supercomputing. Piscataway, USA: IEEE Press, 2008.
[2] Wulf W A, McKee S A. Hitting the Memory Wall: Implications of the Obvious[J]. Computer Architecture News, 1995, 23(1): 20-24.
[3] Margolus N. An Embedded DRAM Architecture for Large-scale Spatial-lattice Computations[C]//Proc. of the 27th Annual International Symposium on Computer Architecture. New York, USA: ACM Press, 2000: 149-160.
[4] 张 英, 杨学军, 唐玉华, 等. PIM: 一种能有效缓解存储墙问题的技术[J]. 计算机研究与发展, 2004, 41(增刊): 347-351.
[5] Wu Dan, Dai Kui, Zou Xuecheng, et al. A High Efficient On-chip Interconnection Network in SIMD CMPs[C]//Proc. of the 10th International Conference on Algorithms and Architecture for Parallel Processing. Busan, Korea: [s. n.], 2010: 149-162.
[6] Chen Pan, Dai Kui, Wu Dan, et al. The Parallel Algorithm Implementation of Matrix Multiplication Based on ESCA[C]// Proc. of the IEEE Asia Pacific Conference on Circuits and Systems.
[S. l.]: IEEE Press, 2010.
[7] 黄安文, 高 军, 张民选. 多核处理器片上存储系统研究[J]. 计算机工程, 2010, 36(4): 4-6.
[8] Rixner S. Stream Processor Architecture[M]. Norwell, USA: Kluwer Academic Publishers, 2001.
[9] 林海波, 谢海波, 邵 凌, 等. Cell BE处理器编程指南[M]. 北京: 电子工业出版社, 2008: 102-104.
[10] Lee Hyuk-Jae, Robertson J P. Generalized Cannon’s Algorithm for Parallel Matrix Multiplication[C]//Proc. of the 11th International Conference on Supercomputing. New York, USA: [s. n.], 1997: 44-51. |