作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2019, Vol. 45 ›› Issue (5): 35-45. doi: 10.19678/j.issn.1000-3428.0050249

• 体系结构与软件技术 • 上一篇    下一篇

基于统计数据的超级计算机内存故障分析

刘睿涛1,陈左宁2   

  1. 1.数学工程与先进计算国家重点实验室,江苏 无锡 214215; 2.国家并行计算机工程技术研究中心,北京 100190
  • 收稿日期:2018-01-23 出版日期:2019-05-15 发布日期:2019-05-15
  • 作者简介:刘睿涛(1977—),男,博士研究生,主研方向为高性能计算、并行操作系统、大数据技术;陈左宁,中国工程院院士、博士生导师。
  • 基金资助:

    2019-05-15

Supercomputers memory faults analysis based on statistical data

LIU Ruitao 1,CHEN Zuoning 2   

  1. 1.State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214215,China; 2.National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China
  • Received:2018-01-23 Online:2019-05-15 Published:2019-05-15

摘要:

基于神威太湖之光和神威蓝光超级计算机的巨量内存故障统计数据,建立P级超级计算机的内存失效时间模型。采用序列规则挖掘方法,分析内存失效序列模式,得到CPU节点上内存失效序列与后续内存失效的关联关系。通过协同分析方法研究并行应用的内存故障与内存失效特征,结果表明计算-访存-I/O密集型应用对内存故障影响较大,而应用类型对内存失效的影响有限,内存失效可能与内存芯片自身的可靠性有关。

关键词: 超级计算机, 内存故障, 内存失效, 统计数据, 失效模型, 关联关系, 协同分析

Abstract:

Based on the massive amount of statistical data about memory faults on Sunway TaihuLight and Sunway BlueLight supercomputers,the memory failure time model for Petascale supercomputers is built.By sequential rule mining,the sequential pattern of memory failures is analyzed and the correlation relationship between memory failure sequences and the following memory failure on CPU nodes is found.The characteristics of memory faults and failures on parallel applications are studied by the co-analysis method.Results show that computing-memory-I/O intensive applications have large impact on memory faults while the type of applications has limited impact on memory failures,which,however,may have correlation relationship with the reliability of memory chips.

Key words: supercomputer, memory fault, memory failure, statistical data, failure model, correlation relationship, co-analysis

中图分类号: