基于磁盘I/O性能的Hadoop任务选择策略

doi:10.3969/j.issn.1000-3428.2016.11.013

计算机工程

基于磁盘I/O性能的Hadoop任务选择策略

李强^1,2,孙震宇^1,2,雷晓凤^1,2,孙功星¹

(1.中国科学院高能物理研究所,北京 100049; 2.中国科学院大学,北京 100049)

收稿日期:2015-10-10 出版日期:2016-11-15 发布日期:2016-11-15
作者简介:李强(1988—),男,博士研究生,主研方向为分布式计算;孙震宇、雷晓凤,博士研究生;孙功星,研究员、博士生导师。
基金资助:
国家自然科学基金(11375223,11375221);国家自然科学基金委员会-中国科学院大科学装置联合基金(11179020)。

Hadoop Task Selection Strategy Based on Disk I/O Performance

LI Qiang^1,2,SUN Zhenyu ^1,2,LEI Xiaofeng ^1,2,SUN Gongxing ¹

(1.Institute of High Energy Physics,Chinese Academy of Sciences,Beijing 100049,China;2.University of Chinese Academy of Sciences,Beijing 100049,China)

Received:2015-10-10 Online:2016-11-15 Published:2016-11-15

摘要/Abstract

摘要： 最大化利用本地磁盘的I/O资源是提升计算集群性能的关键,但Hadoop系统中多数调度算法未考虑此项因素。为此,引入磁盘负载作为Map任务选择的权衡参数,任务调度时参照磁盘负载程度选择合适的任务,以保证数据节点上各磁盘的负载相对均衡,并据此设计新的任务选择模块集成到Hadoop的调度器中。同时为进一步提升Hadoop系统的性能,实现Map作业的近似完全本地化执行。实验结果表明,该任务选择策略能够充分利用数据节点本地磁盘的I/O资源,可使节点的I/O Wait平均降低5%,CPU利用率平均上升15%,作业的执行时间缩短20%。

关键词: Hadoop系统, 调度算法, 数据本地性, 任务选择策略, 磁盘负载, I/O性能

Abstract: Maximum use of local disk I/O resources is the key to improve computing cluster performance,but most of the scheduling algorithms in Hadoop system do not consider this factor.Aiming at this problem,a new task selection strategy is proposed,which takes the disk workload as a parameter in the procedure of MAP task selection and refers to each disk workload to choose the appropriate task during task scheduling,so as to achieve balanced disk workload on data nodes.Besides,a new task selection module is designed and integrated into the task scheduler of Hadoop.In order to further improve Hadoop system’s performance,an appropriate fully localized job execution mechanism is implemented.Experimental results prove that the proposed strategy makes full use of disk I/O resources,reduces I/O Wait by 5% on average,increases CPU utilization rate by 15% on average,and reduces the job execution time by 20%.

Key words: Hadoop system, scheduling algorithm, data locality, task selection strategy, disk workload, I/O performance

中图分类号:

TP391

李强,孙震宇,雷晓凤,孙功星. 基于磁盘I/O性能的Hadoop任务选择策略[J]. 计算机工程.

LI Qiang,SUN Zhenyu,LEI Xiaofeng,SUN Gongxing. Hadoop Task Selection Strategy Based on Disk I/O Performance[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2016/V42/I11/76

参考文献

参考文献［1］Dean J,Ghemawat S.MapReduce:Simplified Data Processing on Large Clusters［J］.Communications of the ACM,2008,51(1):107-113. ［2］Ghemawat S,Gobioff H,Leung S T.The Google File System［C］//Proceedings of the 19th ACM Symposium on Operating Systems Principles.New York,USA:ACM Press,2003:29-43. ［3］Chang F,Dean J,Ghemawat S,et al.Bigtable:A Distributed Storage System for Structured Data［J］.ACM Transactions on Computer Systems,2006,26(2):205-218. ［4］The Apache Software Foundation.HADOOP［EB/OL］.［2015-10-31］.http://hadoop.apache.org. ［5］Leo S,Santoni F,Zanetti G.Biodoop:Bioinformatics on Hadoop［C］//Proceedings of the 41st IEEE International Con-ference on Parallel Processing Workshops.Washington D.C.,USA:IEEE Press,2009:415-422. ［6］Wiley K,Connolly A,Krughoff S.Astronomical Image Processing with Hadoop［J］.Astronomical Data Analysis Software and Systems XX,2011,442:1-4. ［7］臧冬松,霍菁,梁栋,等.基于MapReduce 的高能物理数据分析系统［J］.计算机工程,2014,40(2):1-5. ［8］Glaser F,Neukirchen H,Rings T,et al.Using Map Reduce for High Energy Physics Data Analysis［C］//Proceedings of the 16th IEEE International Conference on Computational Science and Engineering.Washington D.C.,USA:IEEE Press,2013:1271-1278. ［9］Loughran S.Why Not RAID-0? It’s About Time and Snow-flakes［EB/OL］.(2012-11-09).http://zh.horton works.com/blog/why-not-raid-0-its-about-time-and-snowflakes/. ［10］Jones M T.Scheduling in Hadoop［EB/OL］.(2011-12-06).http://www.ibm.com/developerworks/library/os-hadoop-scheduling/. ［11］The Apache Software Foundation.Capacity Scheduler Guide［EB/OL］.(2013-08-04).http://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html. ［12］The Apache Software Foundation.Fair Scheduler for Hadoop［EB/OL］.(2015-06-29).http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. ［13］Zaharia M,Borthakur D,Sen S J,et al.Delay Scheduling:A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling［C］//Proceedings of the 5th European Conference on Computer Systems.New York,USA:ACM Press,2010:265 -278. ［14］Zaharia M,Konwinski A,Joseph A.Improving MapReduce Performance in Heterogeneous Environments［C］//Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation.［S.l.］:USENIX,2008:29-42. ［15］陶永才,李文洁,石磊,等.基于负载均衡的 Hadoop 动态延迟调度机制［J］.小型微型计算机系统,2015,36(3):445-449. ［16］Xie Jiong,Meng Fanjun,Wang Hailong,et al.Research on Scheduling Scheme for Hadoop Clusters［J］.Procedia Computer Science,2013,18:2468-2471. ［17］Seo S,Jang I,Woo K,et al.HPMＲ:Prefetching and Pre-shuffling in Shared MapReduce Computation Environ-ment［C］//Proceedings of IEEE International Con-ference on Cluster Computing.Washington D.C.,USA:IEEE Press,2009:1-8. ［18］Sun Minging,Zhuang Hang,Zhou Xuehai,et al.HPSO:Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters［M］//Sun Xianhe,Qu Wenyu,Stojmenovic I,et al.Algorithms and Architectures for Parallel Processing.Berlin,Germany:Springer,2014:82-95. ［19］Xie Jiong,Yin Shu,Ruan Xiaojun,et al.Improving MapReduce Performance Through Data Placement in Heterogeneous Hadoop Clusters［C］//Proceedings of IEEE International Symposium on Parallel and Dis-tributed Processing.Washington D.C.,USA:IEEE Press,2010:1-9. ［20］Xue Ruini,Gao Shengli,Ao Lixiang,et al.BOLAS:Bipartite-graph Oriented Locality-aware Scheduling for MapReduce Tasks［C］//Proceedings of the 14th IEEE International Symposium on Parallel and Distributed Computing.Washington D.C.,USA:IEEE Press,2015:37-45. ［21］Jin Jiahui,Luo Junzhou,Song Aibo,et al.BAR:An Efficient Data Locality Driven Task Scheduling Algo-rithm for Cloud Computing［C］//Proceedings of IEEE International Symposium on Cluster Computing and the Grid.Washington D.C.,USA:IEEE Press,2011:295-304. 编辑金胡考

[1]	郭羽含, 李文华, 钱亚冠. 融合时空流差的网约车双模式混合调度算法[J]. 计算机工程, 2024, 50(6): 377-393.
[2]	王晞阳, 陈继林, 李猛, 刘首文. FPGA架构上面向稀疏矩阵求解的静态调度算法[J]. 计算机工程, 2022, 48(7): 199-205,213.
[3]	胡栋梁, 秦晓军, 王晓锋. 基于消息中间件的分布式网络扫描[J]. 计算机工程, 2020, 46(12): 163-170.
[4]	张杰鑫, 庞建民, 张铮, 邰铭, 张浩, 聂广来. 面向拟态构造Web服务器的执行体调度算法[J]. 计算机工程, 2019, 45(8): 14-21.
[5]	许子微,刘广钟. 基于服务质量的WiMAX前向兼容LTE上行调度算法[J]. 计算机工程, 2019, 45(4): 93-99.
[6]	韩煦, 张国强, 高茜. 基于SVC与多网络接口的DASH调度算法[J]. 计算机工程, 2019, 45(12): 243-248.
[7]	郑楚红, 彭勇, 徐一鸣, 廖毅. 云制造环境下基于改进NSBBO的任务调度算法[J]. 计算机工程, 2019, 45(10): 26-32.
[8]	严健康,陈更生. 基于CPU/GPU异构资源协同调度的改进H-Storm平台[J]. 计算机工程, 2018, 44(4): 1-11.
[9]	赵瑞姣,朱怡安,李联. 基于异构多核系统的混合关键任务调度算法[J]. 计算机工程, 2018, 44(2): 51-55.
[10]	彭德军,王燕军,李宽,游路瑶. 基于模拟退火的WSAN多设定值调度算法[J]. 计算机工程, 2017, 43(12): 130-135,146.
[11]	罗小波,王超. 考虑服务质量的并行MapReduce启发式车载云资源调度[J]. 计算机工程, 2017, 43(12): 30-37.
[12]	熊安萍,王贤稳,邹洋. 基于Storm拓扑结构热边的调度算法[J]. 计算机工程, 2017, 43(1): 37-42.
[13]	陈持鑫,周继鹏. 基于拥塞控制的无线网络最大权调度算法[J]. 计算机工程, 2016, 42(5): 130-133.
[14]	张治学,曾波,张各各,王辉. 基于多信道的能量高效传感器节点调度算法[J]. 计算机工程, 2015, 41(9): 135-139.
[15]	黄国兵,李瑞玲,李华丽,王琼. μC/OS-II任务优先级调度算法分析与改进[J]. 计算机工程, 2015, 41(8): 52-54,60.

选择文件类型/文献管理软件名称

选择包含的内容

基于磁盘I/O性能的Hadoop任务选择策略

Hadoop Task Selection Strategy Based on Disk I/O Performance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于磁盘I/O性能的Hadoop任务选择策略

Hadoop Task Selection Strategy Based on Disk I/O Performance

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价