MapReduce框架下结合分布式编码计算的容错算法

doi:10.19678/j.issn.1000-3428.0057721

计算机工程 ›› 2021, Vol. 47 ›› Issue (4): 173-179. doi: 10.19678/j.issn.1000-3428.0057721

MapReduce框架下结合分布式编码计算的容错算法

张基, 谢在鹏, 毛莺池, 徐媛媛, 朱晓瑞, 李博文

河海大学计算机与信息学院, 南京 211100

收稿日期:2020-03-13 修回日期:2020-04-28 发布日期:2020-05-09
作者简介:张基(1997-),男,硕士研究生,主研方向为分布式计算;谢在鹏,副教授、博士;毛莺池,教授、博士;徐媛媛、朱晓瑞,博士;李博文,硕士研究生。
基金资助:
国家自然科学基金重点项目（61832005）；国家重点研发计划（2016YFC0402710）。

Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework

ZHANG Ji, XIE Zaipeng, MAO Yingchi, XU Yuanyuan, ZHU Xiaorui, LI Bowen

School of Computer and Information, Hohai University, Nanjing 211100, China

Received:2020-03-13 Revised:2020-04-28 Published:2020-05-09

摘要/Abstract

摘要： 随着分布式系统规模扩大及计算复杂度增加，分布式计算的平均故障修复时间和容错计算所产生的通信开销呈现日益上升趋势。结合分布式编码计算和副本冗余技术，提出一种新的容错算法。map节点应用分布式编码计算的思想，将数据冗余分配至多个计算节点创建编码中间结果，降低计算节点在shuffle阶段的数据传输量。reduce节点通过对接收到的编码中间结果进行解码，从而验证中间结果的正确性并得到最终计算结果。实验结果表明，在基于MapReduce的分布式计算框架下，与三模冗余和两阶段三模冗余容错算法相比，该算法在完成容错计算的同时能降低计算过程中的通信开销和平均故障修复时间，并提高分布式系统的可用性和可靠性。

关键词: 分布式系统, 分布式计算, 容错算法, 分布式编码计算, 三模冗余

Abstract: The growing size and computational complexity of distributed systems lead to an increase in the Mean Time to Repair(MTTR) of distributed computing systems and the communication load caused by fault-tolerant computing.To solve the problems,this paper integrates distributed coding computing with replica redundancy to propose a novel fault-tolerant algorithm.The map node uses the idea of distributed coding computing to allocate data replica to multiple computing nodes to create intermediate coding results and reduce the amount of data transmitted by the computing nodes in the shuffle phase.The reduce node decodes the received intermediate coding result to verify its correctness and obtain the final computing result.Experimental results show that in the MapReduce framework,the proposed algorithm can reduce the communication overhead and MTTR compared with the Triple Modular Redundancy(TMR) and two-stage TMR fault-tolerant algorithms.It also improves the availability and reliability of distributed systems.

Key words: distributed system, distributed computing, fault-tolerant algorithm, distributed coding computing, Triple Modular Redundancy(TMR)

中图分类号:

TP338.8

张基, 谢在鹏, 毛莺池, 徐媛媛, 朱晓瑞, 李博文. MapReduce框架下结合分布式编码计算的容错算法[J]. 计算机工程, 2021, 47(4): 173-179.

ZHANG Ji, XIE Zaipeng, MAO Yingchi, XU Yuanyuan, ZHU Xiaorui, LI Bowen. Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework[J]. Computer Engineering, 2021, 47(4): 173-179.

https://www.ecice06.com/CN/Y2021/V47/I4/173

图/表 7

20210425170147

20210425170150

20210425170155

20210425170158

20210425170200

20210425170203

20210425170205

参考文献

[1] SARI A,AKKAYA M.Fault tolerance mechanisms in distributed systems[J].International Journal of Communica-tions,Network and System Sciences,2015,8(12):471-482.
[2] MARIANI L,PEZZE M,RIGANELLI O.Predicting failures in multi-tier distributed systems[EB/OL].[2020-02-15].https://arxiv.org/abs/1911.09561.
[3] ITANI M,SHARAFEDDINE S,ELKABANI I.Dynamic single node failure recovery in distributed storage systems[J].Computer Networks,2017,113:84-93.
[4] GUO Baolong,WANG Jian,YAN Yunyi,et al.Optimal design of DSP protection based on multi-target PSO algorithm[J].Computer Engineering,2018,44(4):74-80.(in Chinese)郭宝龙,王健,闫允一,等.基于多目标PSO算法的DSP防护优化设计[J].计算机工程,2018,44(4):74-80.
[5] LEI Changjian,LIN Yaping,LI Jinguo,et al.Research on Byzantine fault tolerance under volunteer cloud environ-ment[J].Computer Engineering,2016,42(5):1-7.(in Chinese)雷长剑,林亚平,李晋国,等.志愿云环境下的拜占庭容错研究[J].计算机工程,2016,42(5):1-7.
[6] BERROCAL E,BAUTISTA-GOMEZ L,DI S,et al.Toward general software level silent data corruption detection for parallel applications[J].IEEE Transactions on Parallel and Distributed Systems,2017,28(12):3642-3655.
[7] LI S Z,MADDAH-ALI M A,QIAN Y,et al.A fundamental tradeoff between computation and communication in dis-tributed computing[J].IEEE Transactions on Information Theory,2018,64(1):109-128.
[8] REISIZADEH A,PRAKASH S,PEDARSANI R,et al.Coded computation over heterogeneous clusters[J].IEEE Transactions on Information Theory,2019,65(7):4227-4242.
[9] KONSTANTINIDIS K,RAMAMOORTHY A.Leveraging coding techniques for speeding up distributed computing[C]//Proceedings of 2018 IEEE Global Communications Conference.Washington D.C.,USA:IEEE Press,2018:1-6.
[10] DEAN J,GHEMAWAT S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[11] LI S Z,QIAN Y,MADDAH-ALI M A,et al.Coded distributed computing:fundamental limits and practical challenges[C]//Proceedings of the 50th Asilomar Conference on Signals,Systems and Computers.Washington D.C.,USA:IEEE Press,2016:509-513.
[12] D'ANGELO G,FERRETTI S,MARZOLLA M.Fault tolerant adaptive parallel and distributed simulation through functional replication[J].Simulation Modelling Practice and Theory,2019,93:192-207.
[13] LEDMI A,BENDJENNA H,HEMAM S M.Fault tolerance in distributed systems:a survey[C]//Proceedings of the 3rd International Conference on Pattern Analysis and Intelligent Systems.Washington D.C.,USA:IEEE Press,2018:1-5.
[14] LIAO Weicheng,WU Janjan.Replica-aware job scheduling in distributed systems[C]//Proceedings of Advances in Grid and Pervasive Computing.Berlin,Germany:Springer,2010:290-299.
[15] BARKAHOUM K,HAMOUDI K.A fault-tolerant scheduling algorithm based on check pointing and redundancy for distributed real-time systems[J].International Journal of Distributed Systems and Technologies,2019,10:58-75.
[16] LYONS R E,VANDERKULK W.The use of triple-modular redundancy to improve computer reliability[J].IBM Journal of Research and Development,1962,6(2):200-209.
[17] FU M,HAN S J,LEE P P C,et al.A simulation analysis of redundancy and reliability in primary storage deduplication[J].IEEE Transactions on Computers,2018,67(9):1259-1272.
[18] SALEHI M,KHAVARI TAVANA M,REHMAN S,et al.Energy-efficient permanent fault tolerance in hard real-time systems[J].IEEE Transactions on Computers,2019,68(10):1539-1545.
[19] XU Wenfang,LIU Hongwei,SHU Yanjun,et al.Management board for triple module redundant fault-tolerance system[J].Journal of Tsinghua University(Science and Technology),2011,51(S1):1434-1439.(in Chinese)徐文芳,刘宏伟,舒燕君,等.三模冗余容错系统管理板[J].清华大学学报(自然科学版),2011,51(S1):1434-1439.
[20] ZHOU Ao,WANG Shangguang,CHENG Bo,et al.Cloud service reliability enhancement via virtual machine placement optimization[J].IEEE Transactions on Services Computing,2017,10(6):902-913.
[21] LI Xin,LIN Yufei,GUO Xiaowei.A triple modular eager redundancy fault-tolerant technique for distributed stream architecture[J].Computer Engineering and Science,2015,37(12):2233-2241.(in Chinese)李鑫,林宇斐,郭晓威.面向分布式流体系结构的多副本积极容错技术[J].计算机工程与科学,2015,37(12):2233-2241.
[22] O'MALLEY O.TeraByte sort on Apache Hadoop[EB/OL].[2020-02-15].http://sortbenchmark.org/YahooHadoop.pdf.

选择文件类型/文献管理软件名称

选择包含的内容

MapReduce框架下结合分布式编码计算的容错算法

Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王春东, 王翔宇. 多层次实用拜占庭容错算法改进[J]. 计算机工程, 2023, 49(8): 29-36.
[2]	刘泽坤, 王峰, 贾海蓉. 结合动态信用机制的PBFT算法优化方案[J]. 计算机工程, 2023, 49(2): 191-198.
[3]	刘陕南, 张荣华, 刘长征. 基于分组和信用分级的PBFT共识算法改进方案[J]. 计算机工程, 2023, 49(11): 143-149.
[4]	丁庆丰, 李晋国. 一种物联网环境下的分布式异常流量检测方案[J]. 计算机工程, 2022, 48(8): 152-159.
[5]	陈润宇, 王伦文, 朱然刚. 基于信誉值投票与随机数选举的PBFT共识算法[J]. 计算机工程, 2022, 48(6): 42-49,56.
[6]	刘美佳, 张箐. 基于分布式集群架构的遥感数据传输机制[J]. 计算机工程, 2021, 47(10): 180-185.
[7]	谭敏生, 杨杰, 丁琳, 李行健, 夏石莹. 区块链共识机制综述[J]. 计算机工程, 2020, 46(12): 1-11.
[8]	周健, 屈冉. 一种抗合谋攻击的区块链私钥管理方案[J]. 计算机工程, 2020, 46(11): 23-28.
[9]	宋匡时, 李翀, 张士波. 一个轻量级分布式机器学习系统的设计与实现[J]. 计算机工程, 2020, 46(1): 201-207.
[10]	张璐, 朱海婷. 一种高效的分布式水军群组检测算法[J]. 计算机工程, 2019, 45(7): 6-12.
[11]	赵宝琦, 李卫东, 邹佳恒, 林韬, 颜田. 基于MPI的分布式数据处理系统[J]. 计算机工程, 2019, 45(7): 20-25.
[12]	任良育,赵成萍,严华. 基于任务复制与冗余消除的多核调度算法[J]. 计算机工程, 2019, 45(5): 59-65.
[13]	高军,黄献策. 基于Hadoop平台的相关性权重算法设计与实现[J]. 计算机工程, 2019, 45(3): 26-31.
[14]	王振朝,白莉莎,宋伯尧. UDN中基于K-means聚类算法的干扰协调方案[J]. 计算机工程, 2019, 45(3): 107-112.
[15]	李云洋,周川,王琦. 异构分布式计算环境下一种新型表调度算法[J]. 计算机工程, 2018, 44(8): 43-47.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

MapReduce框架下结合分布式编码计算的容错算法

Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献

相关文章 15

编辑推荐

Metrics

本文评价