作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (4): 173-179. doi: 10.19678/j.issn.1000-3428.0057721

• 体系结构与软件技术 • 上一篇    下一篇

MapReduce框架下结合分布式编码计算的容错算法

张基, 谢在鹏, 毛莺池, 徐媛媛, 朱晓瑞, 李博文   

  1. 河海大学 计算机与信息学院, 南京 211100
  • 收稿日期:2020-03-13 修回日期:2020-04-28 发布日期:2020-05-09
  • 作者简介:张基(1997-),男,硕士研究生,主研方向为分布式计算;谢在鹏,副教授、博士;毛莺池,教授、博士;徐媛媛、朱晓瑞,博士;李博文,硕士研究生。
  • 基金资助:
    国家自然科学基金重点项目(61832005);国家重点研发计划(2016YFC0402710)。

Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework

ZHANG Ji, XIE Zaipeng, MAO Yingchi, XU Yuanyuan, ZHU Xiaorui, LI Bowen   

  1. School of Computer and Information, Hohai University, Nanjing 211100, China
  • Received:2020-03-13 Revised:2020-04-28 Published:2020-05-09

摘要: 随着分布式系统规模扩大及计算复杂度增加,分布式计算的平均故障修复时间和容错计算所产生的通信开销呈现日益上升趋势。结合分布式编码计算和副本冗余技术,提出一种新的容错算法。map节点应用分布式编码计算的思想,将数据冗余分配至多个计算节点创建编码中间结果,降低计算节点在shuffle阶段的数据传输量。reduce节点通过对接收到的编码中间结果进行解码,从而验证中间结果的正确性并得到最终计算结果。实验结果表明,在基于MapReduce的分布式计算框架下,与三模冗余和两阶段三模冗余容错算法相比,该算法在完成容错计算的同时能降低计算过程中的通信开销和平均故障修复时间,并提高分布式系统的可用性和可靠性。

关键词: 分布式系统, 分布式计算, 容错算法, 分布式编码计算, 三模冗余

Abstract: The growing size and computational complexity of distributed systems lead to an increase in the Mean Time to Repair(MTTR) of distributed computing systems and the communication load caused by fault-tolerant computing.To solve the problems,this paper integrates distributed coding computing with replica redundancy to propose a novel fault-tolerant algorithm.The map node uses the idea of distributed coding computing to allocate data replica to multiple computing nodes to create intermediate coding results and reduce the amount of data transmitted by the computing nodes in the shuffle phase.The reduce node decodes the received intermediate coding result to verify its correctness and obtain the final computing result.Experimental results show that in the MapReduce framework,the proposed algorithm can reduce the communication overhead and MTTR compared with the Triple Modular Redundancy(TMR) and two-stage TMR fault-tolerant algorithms.It also improves the availability and reliability of distributed systems.

Key words: distributed system, distributed computing, fault-tolerant algorithm, distributed coding computing, Triple Modular Redundancy(TMR)

中图分类号: