摘要: 在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。
关键词:
机群通信系统,
并行检查点,
容错技术
Abstract: Checkpointing and recovery systems are growing in importance in large-scale clusters. A non-blocking coordinated checkpointing and recovery system is proposed in which reliable communication mechanisms are used to eliminate the overhead of global synchronization. It is shown that a parallel checkpointing system can benefit from supports embedded in low-level communication systems in its implementation and to improve its performance.
Key words:
Cluster communication system,
Parallel checkpointing,
Fault-tolerance
霍志刚;马 捷;孙凝晖. 一个基于通信系统支持的并行检查点系统[J]. 计算机工程, 2007, 33(05): 217-219.
HUO Zhigang; MA Jie; SUN Ninghui. A Parallel Checkpointing System Based on Communication System Support[J]. Computer Engineering, 2007, 33(05): 217-219.