作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 24-37. doi: 10.19678/j.issn.1000-3428.0065704

• 先进计算技术 • 上一篇    下一篇

E级高性能计算机的维护故障诊断系统研究

建澜涛1, 任秀江2, 张祯1, 石嵩1, 黄益明1, 张春林1   

  1. 1. 江南计算技术研究所, 江苏 无锡 214083;
    2. 国家并行计算机工程技术研究中心, 北京 100190
  • 收稿日期:2022-09-07 修回日期:2022-11-08 发布日期:2022-12-07
  • 作者简介:建澜涛(1978—),女,副研究员、硕士,主研方向为高性能计算机体系结构、故障诊断;任秀江,工程师、博士;张祯,副研究员、硕士;石嵩,助理研究员、硕士;黄益明,副研究员;张春林,助理研究员、硕士。
  • 基金资助:
    “十四五”国家重点研发计划(2021YFB0300900,2021YFB0301000)。

Research on Maintenance Fault Diagnosis System for E-class High-Performance Computer

JIAN Lantao1, REN Xiujiang2, ZHANG Zhen1, SHI Song1, HUANG Yiming1, ZHANG Chunlin1   

  1. 1. Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083, China;
    2. National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
  • Received:2022-09-07 Revised:2022-11-08 Published:2022-12-07

摘要: E级计算机系统规模巨大,使得故障异常总量随之增多,导致诊断发现的难度增加,因此,迫切需要一套更加准确高效的实时维护故障诊断系统,对硬件系统进行全面的异常及故障信息实时检测、故障诊断及故障预测。传统故障诊断系统在面对数万节点规模的诊断时存在执行效率低、异常检测误报率高的问题,异常检测及故障诊断的覆盖率不足。对异常及故障检测、故障诊断与故障预测相关技术进行研究,分析技术原理及适用性,并结合E级高性能计算机实际工程需求,设计一套满足数E级高性能计算机需求的维护故障诊断系统。基于维护系统的结构组成设计可扩展的边缘诊断架构,将高性能计算机系统知识、专家知识与数理统计、机器学习相融合给出故障检测、诊断及预测算法,并针对专用场景建立预测模型。实验结果表明,该系统具有较好的可扩展性,能在10 s内完成对十万个节点规模系统的故障诊断,与传统故障诊断系统相比,异常检测某特定指标误报率从3.3%降低到几乎为0,硬件故障检测覆盖率从90.2%提升至96%以上,硬件故障诊断覆盖率从71%提升至约94%,能较准确地预测多个重要应用场景下的故障。

关键词: 高性能计算, 维护系统, 异常检测, 故障诊断, 故障预测

Abstract: E-class computer systems typically have huge scales.Consequently, the total number of abnormal faults is bound to increase, resulting in difficulty in fault diagnosis.Thus, there is an urgent need for the development of a more accurate and efficient real-time maintenance fault diagnosis system that is able to perform comprehensive real-time detection, fault diagnosis, and fault prediction of hardware systems using abnormal and fault information. The traditional fault diagnosis system faces the problems of low execution efficiency and high false-positive rate of abnormal detection in the face of tens of thousands of nodes.Additionally, the coverage rate of abnormal detection and fault diagnosis is insufficient.In this study, abnormal and fault detection, fault diagnosis, and fault prediction of related technologies are evaluated, the principle and applicability of the main methods are analyzed, and a set of maintenance fault diagnosis systems that can meet the needs of E-class high-performance computers are designed in combination with the actual engineering requirements of E-class high-performance computers.Moreover, a scalable edge diagnosis architecture is designed based on the structural composition of the maintenance system, and the high-performance computer system knowledge and expert knowledge are integrated using mathematical statistics and machine learning to design fault detection, diagnosis, and prediction algorithms.Finally, prediction models are established for special scenarios.The experimental results show that the system has good scalability compared with the traditional fault diagnosis system, and can complete the fault diagnosis of 100 000 nodes within 10 s. Additionally, the false positive rate of the specific indicator used in anomaly detection is reduced from 3.3% to almost zero, the hardware fault detection coverage rate is increased from 90.2% to more than 96%, the hardware fault diagnosis coverage is increased from 71% to 94%, and the fault prediction can accurately predict the faults in several important application scenarios.

Key words: high-performance computing, maintenance system, anomaly detection, fault diagnosis, fault prediction

中图分类号: