摘要: 检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。
关键词:
检查点设置和卷回恢复,
检查点开销,
通信阻塞时间
Abstract: Checkpointing and rollback recovery is an effect way to improve system reliability and implement fault-tolerant computation. It is usually evaluated by overhead ratio, which is primarily effected by checkpoint overhead. As there is much communication blocking time while parallel program is running, a method based on copy-on-write checkpoint buffering is proposed to further reduce checkpoint overhead. By controlling the running of checkpointing thread and selecting a suitable granularity, the method can hide the overhead caused by checkpointing thread very well and thus reduce overhead ratio.
Key words:
Checkpointing and rollback recovery,
Checkpoint overhead,
Communication blocking time
中图分类号:
周小成;孙凝晖;霍志刚;马 捷. 一种降低并行程序检查点开销的方法[J]. 计算机工程, 2007, 33(12): 84-86.
ZHOU Xiaocheng; SUN Ninghui; HUO Zhigang; MA Jie. Method for Reducing Checkpoint Overhead of Parallel Program[J]. Computer Engineering, 2007, 33(12): 84-86.