Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2007, Vol. 33 ›› Issue (05): 217-219. doi: 10.3969/j.issn.1000-3428.2007.05.077

• Engineer Application Technology and Realization • Previous Articles     Next Articles

A Parallel Checkpointing System Based on Communication System Support

HUO Zhigang1, MA Jie2, SUN Ninghui2   

  1. (1. Graduate School of Chinese Academy of Sciences, Beijing 100080; 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-03-05 Published:2007-03-05

一个基于通信系统支持的并行检查点系统

霍志刚1,马 捷2,孙凝晖2   

  1. (1. 中国科学院研究生院,北京 100080;2. 中国科学院计算技术研究所,北京 100080)

Abstract: Checkpointing and recovery systems are growing in importance in large-scale clusters. A non-blocking coordinated checkpointing and recovery system is proposed in which reliable communication mechanisms are used to eliminate the overhead of global synchronization. It is shown that a parallel checkpointing system can benefit from supports embedded in low-level communication systems in its implementation and to improve its performance.

Key words: Cluster communication system, Parallel checkpointing, Fault-tolerance

摘要: 在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。

关键词: 机群通信系统, 并行检查点, 容错技术