计算机工程 ›› 2019, Vol. 45 ›› Issue (3): 20-25,31.doi: 10.19678/j.issn.1000-3428.0052626

所属专题: 云计算与大数据专题

• 云计算与大数据专题 • 上一篇    下一篇

分布式环境下时态大数据的连接操作研究

张伟1,王志杰2   

  1. 1.上海交通大学 计算机科学与工程系,上海 200240; 2.中山大学 数据科学与计算机学院,广州 510006
  • 收稿日期:2018-09-11 出版日期:2019-03-15 发布日期:2019-03-15
  • 作者简介:张伟(1990—),男,硕士,主研方向为大数据分析、分布式计算;王志杰,副研究员、博士
  • 基金项目:

    国家自然科学基金(U1636210,61729202);广东省科技计划项目(2015A030401057,2016B030307002)。

Research on Join Operation of Temporal Big Data in Distributed Environment

ZHANG Wei 1,WANG Zhijie 2   

  1. 1.Department of Computer Science and Engineering,Shanghai Jiaotong University,Shanghai 200240,China; 2.School of Data and Computer Science,Sun Yat-Sen University,Guangzhou 510006,China
  • Received:2018-09-11 Online:2019-03-15 Published:2019-03-15

摘要:

目前处理时态大数据连接操作多数运用分布式系统,但现有的分布式系统尚不能支持原生的时态连接查询,无法满足时态大数据低延迟和高吞吐量的处理需求。为此,提出一个基于Spark的二级索引内存解决方案。运用全局索引进行分布式分区的剪枝,使用局部时态索引进行分区内查询,提高数据检索效率。针对时态数据设计分区方法,以对全局剪枝进行优化。基于真实和合成数据集的实验结果表明,与基准方案相比,该方案可明显提高时态连接操作的处理效率。

关键词: 时态大数据, 分布式内存计算, 时态连接, 二级索引, 分区方法, Spark框架

Abstract:

Distributed system is an ideal choice for processing temporal large data join operation,but the existing distributed system cannot support the original temporal join query and cannot meet the processing requirements of temporal large data with low latency and high throughput.Therefore,a two-level index memory solution scheme based on Spark is proposed.The global index is used to prune the distributed partitions,and the local temporal index is used to query the partitions in order to improve the efficiency of data retrieval.A partition method is designed for temporal data to optimize global pruning.Experimental results based on real and synthetic datasets show that the scheme can significantly improve the processing efficiency of temporal join operation.

Key words: temporal big data, distributed memory computing, temporal join, two-level index, partition method, Spark framework

中图分类号: