作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2018, Vol. 44 ›› Issue (12): 46-55. doi: 10.19678/j.issn.1000-3428.0050299

• 先进计算与数据处理 • 上一篇    下一篇

数据驱动的自适应容错技术研究

刘睿涛1,陈左宁2   

  1. 1.数学工程与先进计算国家重点实验室,江苏 无锡 214215; 2.国家并行计算机工程技术研究中心,北京 100190
  • 收稿日期:2018-01-26 出版日期:2018-12-15 发布日期:2018-12-15
  • 作者简介:刘睿涛(1977—),男,博士研究生,主研方向为高性能计算、并行损伤系统、容错技术;陈左宁,博士生导师、中国工程院院士。
  • 基金资助:

    国家重点研发计划(2016YFB0200502)

Study of Data-driven Self-adaptive Fault-tolerant Technology

LIU Ruitao 1,CHEN Zuoning 2   

  1. 1.State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214215,China; 2.National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China
  • Received:2018-01-26 Online:2018-12-15 Published:2018-12-15

摘要:

基于系统故障数据建立层次化失效模型,有助于优化检查点,提升系统可用性水平,应对未来E级计算的可靠性挑战。以太湖之光系统为研究主体,介绍故障采集、分类与处理机制,基于实际故障数据建立细粒度失效分布模型及应用级失效模型的多层失效模型,量化应用运行环境的可靠性。以该模型为基础,分析自适应的检查点容错优化模型,为检查点优化提供理论与工程依据。以太湖之光系统为例进行检查点的容错优化分析,结果表明,数据驱动的自适应容错模型可有效降低系统检查点开销。

关键词: 超级计算机, 失效模型, 数据驱动, 自适应, 容错技术

Abstract:

Establishing a hierarchical failure model based on system fault data helps to optimize checkpoints,and improves system availability,and addresses the reliability challenges of future Exascale computing.Taking Sunway TaihuLight as the research subject,the fault collection,classification and processing mechanism are introduced.Based on the actual fault data,a multi-layer failure model including fine-grained failure distribution model and application-level failure model is established to quantify the reliability of the application operating environment.Based on the model,the adaptive checkpoint fault-tolerant optimization model is analyzed to provide theoretical and engineering basis for checkpoint optimization.Taking Sunway TaihuLight as an example,the fault-tolerant optimization analysis result of checkpoints shows that data-driven adaptive fault-tolerant model can effectively reduce system checkpoint overhead.

Key words: supercomputer, failure model, data-driven, self-adaptive, fault-tolerant technology

中图分类号: