基于列存储的MapReduce并行连接算法

doi:10.3969/j.issn.1000-3428.2014.08.014

计算机工程

基于列存储的MapReduce并行连接算法

张滨^1,2,乐嘉锦¹

(1.东华大学计算机科学与技术学院,上海 201620;2.浙江财经大学,杭州 310018)

收稿日期:2013-06-20 出版日期:2014-08-15 发布日期:2014-08-15
作者简介:张滨(1978－),男,博士研究生,主研方向:数据库技术;乐嘉锦,教授、博士生导师。
基金资助:
国家自然科学基金资助项目(61070031,61070032);浙江省教育厅科研基金资助项目(Y201225326)。

MapReduce Parallel Join Algorithm Based on Column-store

ZHANG Bin^1,2,LE Jia-jin¹

(1.School of Computer Science and Technology,Donghua University,Shanghai 201620,China;2.Zhejiang University of Finance & Economics,Hangzhou 310018,China)

Received:2013-06-20 Online:2014-08-15 Published:2014-08-15

摘要/Abstract

摘要：

针对传统关系型数据库在对大数据进行操作时,系统性能严重下降、计算效率提升有限以及可扩展性差等问题,引入MapReduce并行计算模型,提出一种大数据上基于列存储的MapReduce并行连接算法。设计面向大数据的分布式计算模型,包括MapReduce分布式环境下的列存储文件格式MCF,采用协同定位策略实现对分布式存储的优化。使用分片聚集和子连接启发式优化方法,实现大数据在MapReduce分布式环境下并行连接算法。实验结果证明,在大数据分析处理中,该算法在执行时间和负载能力上有着较好的优化性能,同时具有良好的可扩展性。

关键词: 大数据, 列存储, MapReduce模型, MCF存储格式, 并行连接, 启发式优化方法

Abstract:

The character of big data are large scale,depth,velocity,common hardware and open source.Aiming at the system’s inefficiency and scalability problem of traditional relational database in big data analysis,this paper presents an algorithm of parallel join in a MapReduce environment based on column-store by introducing MapReduce computing model.The design of large data-oriented distributed computing models is proposed.It designs the MapReduce column-store file,and achieves optimization by cooperative localization strategy.Secondly,and the partition aggregation and the heuristic optimization strategy to realize the implementation of parallel join algorithm are proposed.Experimental results show that the algorithm has the high performance and scalability in execution time and load capacity. 

Key words: big data, column-store, MapReduce model, MCF storage form, parallel join;heuristic optimization method

中图分类号:

TP181

张滨,乐嘉锦. 基于列存储的MapReduce并行连接算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.08.014.

ZHANG Bin,LE Jia-jin. MapReduce Parallel Join Algorithm Based on Column-store[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.08.014.

http://www.ecice06.com/CN/Y2014/V40/I8/70

参考文献

［1］ Dean J,Ghemawat S.MapReduce:Simplified Data Processing on Large Clusters［C］//Proc.of OSDI’04.San Francisco:［s.n.］,2004:137-150. ［2］ Abadi D J,Madden S R,Hachem N.Column-stores vs.Row-stores:How Different Are They Really?［C］//Proc.of ACM SIGMOD’08.Vancouver,Canada:ACM Press,2008:967-980. ［3］ Stonebraker M,Abadi D J,Batkin A,et al.C-store:A Column-oriented DBMS［C］//Proc.of VLDB Conference.Trondheim,Norway:［s.n.］,2005:553-564. ［4］ Boncz P,Zukowski M,Nes N.MonetDB/X100:Hyper-pipelining Query Execution［C］//Proc.of CIDR’05.Asilomar,USA:ACM Press,2005:251-264. ［5］ Blanas S,Patel J M,Ercegovac V,et al.A Comparison of Join Algorithms for Log Processing in MapReduce［C］//Proc.of ACM SIGMOD International Conference on Management of Data.Indianapolis,USA:ACM Press,2010:975-986. ［6］ Abouzeid A,Bajda-Pawlikowski K,Abadi D J,et al.HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads［C］//Proc.of VLDB Conference.Lyon,France:［s.n.］,2009:922-933. ［7］ Bajda-Pawlikowski K,Abadi D J,Silberschatz A,et al.Efficient Processing of Data Warehousing Queries［C］// Proc.of ACM SIGMOD International Conference on Management of Data.Athens,Greece:ACM Press,2011:1165-1176. ［8］ Lin Yuting,Agrawal D,Chen Chun,et al.Llama:Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework［C］//Proc.of ACM SIGMOD International Conference on Management of Data.Athens,Greece:ACM Press,2011:961-972. ［9］ Floratou A,Patel J M,Shekita E J.Column-oriented Storage Techniques for MapReduce.The VLDB Journal,2011,4(7):419-429. ［10］覃雄派,王会举,杜小勇,等.大数据分析——RDBMS与MapReduce的竞争与共生［J］.软件学报,2012,23(1):32-45. ［11］师金钢,鲍玉斌,冷芳玲,等.基于MapReduce 的关系型数据仓库并行查询［J］.东北大学学报:自然科学版,2011,32(5):626-629. ［12］ Thusoo A,Sarma J S,Jain N,et al.Hive——A Warehousing Solution over a Map-reduce Framework［C］//Proc.of VLDB Conference.Lyon,France:［s.n.］,2009:1626-1629.  ［13］ He Yongqiang,Lee R,Yin Huai,et al.RCFile:A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems［C］//Proc.of International Conference on Data Engineering.Hannover,Germany:IEEE Press,2011:1199-1208. ［14］严秋玲,孙莉,王梅,等.列存储数据仓库中启发式查询优化机制［J］.计算机学报,2011,10(34):2018-2026. ［15］ O′Neil P,O′Neil B,Chen Xuedong.Star Schema Benchmark Revision3［EB/OL］.(2010-02-09).http:// www.cs.umb.edu/~poneil. 编辑索书志

[1]	王新迪, 杨夙, 张思源, 罗午阳, 李杰, 刘辉. 基于时空大数据与卫星图像的城市火灾风险预测[J]. 计算机工程, 2023, 49(6): 242-249.
[2]	何晓斌, 高洁, 肖伟, 陈起, 刘鑫, 陈左宁. 应用透明的超算多层存储加速技术研究[J]. 计算机工程, 2022, 48(12): 1-8.
[3]	郭威, 谢光伟, 张帆, 李敏. 一种分布式存储系统拟态化架构设计与实现[J]. 计算机工程, 2020, 46(6): 12-19.
[4]	张海军, 陈映辉. 基于类图像处理与向量化的大数据脚本攻击智能检测[J]. 计算机工程, 2020, 46(3): 129-137,143.
[5]	王芝辉, 王晓东. 基于神经网络的文本分类方法研究[J]. 计算机工程, 2020, 46(3): 11-17.
[6]	王重仁, 王雯, 佘杰, 凌晨. 融合深度神经网络的个人信用评估方法[J]. 计算机工程, 2020, 46(10): 308-314.
[7]	白玲玲, 宁振虎, 薛菲, 杨永丽. 隐马尔可夫模型在恶意域名检测中的应用[J]. 计算机工程, 2019, 45(9): 161-168.
[8]	钱雪忠,姚琳燕. 面向稀疏高维大数据的扩展增量模糊聚类算法[J]. 计算机工程, 2019, 45(6): 75-81.
[9]	张玺君, 袁占亭, 张红, 高玮军, 张恩展. 交通轨迹大数据预处理方法研究[J]. 计算机工程, 2019, 45(6): 26-31.
[10]	蒋猛,禹明刚,王智学. 多策略自适应大规模本体映射算法[J]. 计算机工程, 2019, 45(3): 14-19.
[11]	张伟,王志杰. 分布式环境下时态大数据的连接操作研究[J]. 计算机工程, 2019, 45(3): 20-25,31.
[12]	陈军晓, 李中升, 刘逸敏, 李秋虹, 汪卫. 基于MapReduce的时间序列索引与批量查询技术[J]. 计算机工程, 2019, 45(11): 47-53.
[13]	孙庆鑫,雷迎春,龚奕利. 基于共享存储的MPP数据库连接执行研究[J]. 计算机工程, 2018, 44(6): 24-28.
[14]	卓煜,尤佳莉,王劲林,齐卫宁,乔楠楠. 海服务中面向在线视频服务的测量与推荐系统[J]. 计算机工程, 2018, 44(4): 28-34,40.
[15]	李贞镐,金德鹏. 基于移动大数据的城市深夜公交线路改进方案[J]. 计算机工程, 2018, 44(4): 23-27.

选择文件类型/文献管理软件名称

选择包含的内容

基于列存储的MapReduce并行连接算法

MapReduce Parallel Join Algorithm Based on Column-store

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于列存储的MapReduce并行连接算法

MapReduce Parallel Join Algorithm Based on Column-store

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价