Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2007, Vol. 33 ›› Issue (12): 84-86. doi: 10.3969/j.issn.1000-3428.2007.12.030

• Software Technology and Database • Previous Articles     Next Articles

Method for Reducing Checkpoint Overhead of Parallel Program

ZHOU Xiaocheng1,2, SUN Ninghui2, HUO Zhigang1,2, MA Jie2   

  1. (1. Graduate School of Chinese Academy of Sciences, Beijing 100080; 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-06-20 Published:2007-06-20

一种降低并行程序检查点开销的方法

周小成1,2,孙凝晖2,霍志刚1,2,马 捷2   

  1. (1. 中国科学院研究生院,北京100080;2. 中国科学院计算技术研究所,北京100080)

Abstract: Checkpointing and rollback recovery is an effect way to improve system reliability and implement fault-tolerant computation. It is usually evaluated by overhead ratio, which is primarily effected by checkpoint overhead. As there is much communication blocking time while parallel program is running, a method based on copy-on-write checkpoint buffering is proposed to further reduce checkpoint overhead. By controlling the running of checkpointing thread and selecting a suitable granularity, the method can hide the overhead caused by checkpointing thread very well and thus reduce overhead ratio.

Key words: Checkpointing and rollback recovery, Checkpoint overhead, Communication blocking time

摘要: 检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。

关键词: 检查点设置和卷回恢复, 检查点开销, 通信阻塞时间

CLC Number: