基于Hadoop平台的事实并行处理算法

doi:10.3969/j.issn.1000-3428.2014.03.012

计算机工程

基于Hadoop平台的事实并行处理算法

孙莉，何刚，李继云

(东华大学计算机科学与技术学院，上海 201620)

收稿日期:2013-09-02 出版日期:2014-03-15 发布日期:2014-03-13
作者简介:孙莉(1964－)，女，副教授、博士，主研方向：数据库技术，面向对象分析与设计；何刚，硕士研究生；李继云，副教授、博士。

Parallel Processing Algorithms for Facts Based on Hadoop Platform

SUN Li, HE Gang, LI Ji-yun

(School of Computer Science and Technology, Donghua University, Shanghai 201620, China)

Received:2013-09-02 Online:2014-03-15 Published:2014-03-13

摘要/Abstract

摘要： 针对传统的抽取、转换和加载工具在面临数据仓库中海量事实数据时效率较低的问题，从事实表查找代理键和多粒度事实预聚合2个角度出发，提出在渐变维度表上的多路并行查找算法和在不同粒度上对事实数据进行聚合的算法。第1种算法综合考虑了渐变维度和大维度的情况，运用分布式缓存方法将小维度表复制到各个数据节点的内存中，同时对事实数据和大维度数据采用相同的分区函数进行分区，从而解决内存不足的问题，在Map阶段实现多路查找代理键，避免由于数据传输产生的网络延迟。第2种算法在Reduce阶段之后增加Merge阶段，可有效解决事实数据按照不同粒度进行聚合的问题。实验结果表明，与Hive数据仓库相比，2种算法在并行处理数据仓库的事实数据的问题上具有更高的处理效率。

关键词: MapReduce模型, 维度, 事实, 代理键, 并行查找, 聚合

Abstract: In view of that traditional Extract, Transform, Load(ETL) tools face the efficient problem of the massive fact data in data warehouse, two algorithms about parallel processing facts are designed and implemented based on Hadoop platform. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The first algorithm considers slowly changing dimensions and big dimensions synthetically. In order to solve the problem of out of memory, the algorithm adopts an approach to the distributed cache to copy small dimensions to every date nodes’ memory. And implementing multi-way lookup of dimension keys in the stage of map is to avoid network delay result from data transmission. The second algorithm adds merge stage after reducing stage, so it is beneficial to solve the aggregation problem of the fact data according to different granularity effectively. Experimental results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.

Key words: MapReduce model, dimension, fact, surrogate key, parallel lookup, aggregation

中图分类号:

TP311

孙莉，何刚，李继云. 基于Hadoop平台的事实并行处理算法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2014.03.012.

SUN Li, HE Gang, LI Ji-yun. Parallel Processing Algorithms for Facts Based on Hadoop Platform[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2014.03.012.

http://www.ecice06.com/CN/Y2014/V40/I3/59

[1]	马坤, 安敬民, 李冠宇. 动态聚合实体和关系上下文的知识图谱补全[J]. 计算机工程, 2023, 49(8): 77-84, 95.
[2]	戎珂瑶, 熊贇. 基于多维度异质图结构的代码注释自动生成[J]. 计算机工程, 2023, 49(4): 240-248.
[3]	邹长龙, 安敬民, 李冠宇. 基于邻域聚合与CNN的知识图谱实体类型补全[J]. 计算机工程, 2023, 49(3): 134-141.
[4]	牛淑芬, 闫森, 吕锐曦, 周思玮, 张美玲. V2V车联网中隐私保护性异构聚合签密方案[J]. 计算机工程, 2022, 48(9): 20-27,36.
[5]	朱凌, 王雅萍, 廖丽敏. 基于共现流增强双向金字塔卷积网络的密集液滴识别[J]. 计算机工程, 2022, 48(7): 241-246,253.
[6]	王安志, 任春洪, 何淋艳, 杨元英, 欧卫华. 基于多模态多级特征聚合网络的光场显著性目标检测[J]. 计算机工程, 2022, 48(7): 227-233,240.
[7]	谭元珍, 李晓楠, 李冠宇. 基于邻域聚合的实体对齐方法[J]. 计算机工程, 2022, 48(6): 65-72.
[8]	赵欣灿, 朱云, 毛伊敏. 基于MapReduce的高维数据频繁项集挖掘[J]. 计算机工程, 2022, 48(3): 81-89.
[9]	黄帅, 张毅. 基于梯形跨尺度特征耦合网络的SAR图像舰船检测[J]. 计算机工程, 2022, 48(12): 270-280.
[10]	张奕, 郑婧, 蔡钢生, 王真梅. 基于GAT双聚合运算与归纳式矩阵补全的关联预测[J]. 计算机工程, 2022, 48(12): 72-78.
[11]	陈虹, 侯宇婷, 郭鹏飞, 周沫, 赵菊芳, 肖成龙. 可公开验证的高效无证书聚合签密方案[J]. 计算机工程, 2022, 48(10): 146-157.
[12]	胡晓强, 魏丹, 王子阳, 沈江霖, 任洪娟. 基于时空关注区域的视频行人重识别[J]. 计算机工程, 2021, 47(6): 277-283.
[13]	张晓均, 张经伟, 黄超, 唐伟. 可验证的云存储医疗加密数据统计分析方案[J]. 计算机工程, 2021, 47(6): 32-37,43.
[14]	牛淑芬, 韩松, 于斐, 王彩芬, 杜小妮. 区块链电子病历中基于密钥聚合的密文检索方案[J]. 计算机工程, 2021, 47(5): 36-43.
[15]	黄彬, 胡立坤, 张宇. 基于自适应权重的改进Census立体匹配算法[J]. 计算机工程, 2021, 47(5): 189-196.

选择文件类型/文献管理软件名称

选择包含的内容

基于Hadoop平台的事实并行处理算法

Parallel Processing Algorithms for Facts Based on Hadoop Platform

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于Hadoop平台的事实并行处理算法

Parallel Processing Algorithms for Facts Based on Hadoop Platform

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价