作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (05): 217-219. doi: 10.3969/j.issn.1000-3428.2007.05.077

• 工程应用技术与实现 • 上一篇    下一篇

一个基于通信系统支持的并行检查点系统

霍志刚1,马 捷2,孙凝晖2   

  1. (1. 中国科学院研究生院,北京 100080;2. 中国科学院计算技术研究所,北京 100080)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-03-05 发布日期:2007-03-05

A Parallel Checkpointing System Based on Communication System Support

HUO Zhigang1, MA Jie2, SUN Ninghui2   

  1. (1. Graduate School of Chinese Academy of Sciences, Beijing 100080; 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-03-05 Published:2007-03-05

摘要: 在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。

关键词: 机群通信系统, 并行检查点, 容错技术

Abstract: Checkpointing and recovery systems are growing in importance in large-scale clusters. A non-blocking coordinated checkpointing and recovery system is proposed in which reliable communication mechanisms are used to eliminate the overhead of global synchronization. It is shown that a parallel checkpointing system can benefit from supports embedded in low-level communication systems in its implementation and to improve its performance.

Key words: Cluster communication system, Parallel checkpointing, Fault-tolerance