分布式环境下时态大数据的连接操作研究

doi:10.19678/j.issn.1000-3428.0052626

计算机工程 ›› 2019, Vol. 45 ›› Issue (3): 20-25,31. doi: 10.19678/j.issn.1000-3428.0052626

所属专题：云计算与大数据专题；

分布式环境下时态大数据的连接操作研究

张伟¹,王志杰²

1.上海交通大学计算机科学与工程系,上海 200240; 2.中山大学数据科学与计算机学院,广州 510006

收稿日期:2018-09-11 出版日期:2019-03-15 发布日期:2019-03-15
作者简介:张伟(1990—),男,硕士,主研方向为大数据分析、分布式计算;王志杰,副研究员、博士
基金资助:
国家自然科学基金(U1636210,61729202);广东省科技计划项目(2015A030401057,2016B030307002)。

Research on Join Operation of Temporal Big Data in Distributed Environment

ZHANG Wei ¹,WANG Zhijie²

1.Department of Computer Science and Engineering,Shanghai Jiaotong University,Shanghai 200240,China; 2.School of Data and Computer Science,Sun Yat-Sen University,Guangzhou 510006,China

Received:2018-09-11 Online:2019-03-15 Published:2019-03-15

摘要/Abstract

摘要：

目前处理时态大数据连接操作多数运用分布式系统,但现有的分布式系统尚不能支持原生的时态连接查询,无法满足时态大数据低延迟和高吞吐量的处理需求。为此,提出一个基于Spark的二级索引内存解决方案。运用全局索引进行分布式分区的剪枝,使用局部时态索引进行分区内查询,提高数据检索效率。针对时态数据设计分区方法,以对全局剪枝进行优化。基于真实和合成数据集的实验结果表明,与基准方案相比,该方案可明显提高时态连接操作的处理效率。

关键词: 时态大数据, 分布式内存计算, 时态连接, 二级索引, 分区方法, Spark框架

Abstract:

Distributed system is an ideal choice for processing temporal large data join operation,but the existing distributed system cannot support the original temporal join query and cannot meet the processing requirements of temporal large data with low latency and high throughput.Therefore,a two-level index memory solution scheme based on Spark is proposed.The global index is used to prune the distributed partitions,and the local temporal index is used to query the partitions in order to improve the efficiency of data retrieval.A partition method is designed for temporal data to optimize global pruning.Experimental results based on real and synthetic datasets show that the scheme can significantly improve the processing efficiency of temporal join operation.

Key words: temporal big data, distributed memory computing, temporal join, two-level index, partition method, Spark framework

中图分类号:

TP391

张伟,王志杰. 分布式环境下时态大数据的连接操作研究[J]. 计算机工程, 2019, 45(3): 20-25,31.

ZHANG Wei,WANG Zhijie. Research on Join Operation of Temporal Big Data in Distributed Environment[J]. Computer Engineering, 2019, 45(3): 20-25,31.

https://www.ecice06.com/CN/Y2019/V45/I3/20

参考文献

［1］ZHANG S,YANG Y,FAN W,et al.OceanRT:real-time analytics over large temporal data［C］//Proceedings of 2014 ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2014:1099-1102.
［2］周亮,李格非,邰伟鹏,等:基于Spark的时态查询扩展与时态索引优化研究［J］.计算机工程,2017,43(7):22-28,37.
［3］ZHANG D,TSOTRAS V L.Seeger:efficient temporal join processing using indices［C］//Proceedings of the 18th International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2002:401-412.
［4］LU H,YANG B,JENSEN C S.Spatio-temporal joins on symbolic indoor tracking data［C］//Proceedings of the 27th IEEE International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2011:125-136.
［5］BECKER B,GSCHWIND S,OHLER T,et al.Widmayer:an asymptotically optimal multiversion B-tree［J］.The VLDB Journal,1996,5(4):264-275.
［6］Block-based join algorithms［EB/OL］.［2018-04-01］.https://mariadb.com/kb/en/library/block-based-join-algorithms/.
［7］LESKOVEC J,KREVL A.SNAP datasets:stanford large network dataset collection［EB/OL］.［2018-05-21］.http://snap.stanford.edu/data.
［8］MAHMOOD A R,PUNNI S,AREF W G.Spatio-temporal access methods:a survey［EB/OL］.［2018-05-21］.https://link.springer.com.
［9］CHENG K.On computing temporal aggregates over null time intervals［C］//Proceedings of International Conference on Database and Expert Systems Applications.Washington D.C.,USA:IEEE Press,2017:67-79.
(下转第31页)
(上接第25页)
［10］KAUFMANN M,FISHHER P M,MAY N,et al.Bi-temporal timeline index:a data structure for processing queries on bi-temporal data［C］//Proceedings of ICDE’15.Washington D.C.,USA:IEEE Press,2015:215-226.
［11］周风华,汤庸,康向锋.一种有效的双时态索引技术［J］.计算机工程与应用,2005,41(13):231-239.
［12］WANG P,ZHANG P,ZHOU C,et al.Hierarchical evolving dirichlet processes for modeling nonlinear evolutionary traces in temporal data［J］.Data Mining and Knowledge Discovery,2017,31(1):32-64.
［13］LOGLISCI C,CECI M,MALERBA D.A temporal data mining framework for analyzing longitudinal data［C］//Proceedings of International Conference on Database and Expert Systems Applications.Washington D.C.,USA:IEEE Press,2011:154-165.
［14］LE W,LI F,TAO Y,et al.Optimal splitters for temporal and multi-version databases［C］//Proceedings of SIGMOD’13.Washington D.C.,USA:IEEE Press,2013:321-329.
［15］KAUFMANN M,MANJILI A,VAGENAS A P,et al.Timeline index:a unified data structure for processing queries on temporal data in SAP HANA［C］//Proceedings of SIGMOD’13.Washington D.C.,USA:IEEE Press,2013:124-132.
［16］ELMASRI R,WUU G T,KIM Y J.The time index:an access structure for temporal data［C］//Proceedings of the 16th International Conference on Very Large Data Bases.Washington D.C.,USA:IEEE Press,1990:125-136.

[1]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[2]	周亮,李格非,邰伟鹏,郑啸. 基于Spark的时态查询扩展与时态索引优化研究[J]. 计算机工程, 2017, 43(7): 22-28,37.
[3]	任胜兵,张健威,吴斌,王志健. Spark环境下基于SMT的分布式限界模型检测[J]. 计算机工程, 2017, 43(6): 19-23,29.
[4]	黄明政，韩一石. 一种可实现零内存存取的CAVLC解码算法[J]. 计算机工程, 2014, 40(3): 278-282.

选择文件类型/文献管理软件名称

选择包含的内容

分布式环境下时态大数据的连接操作研究

Research on Join Operation of Temporal Big Data in Distributed Environment

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

分布式环境下时态大数据的连接操作研究

Research on Join Operation of Temporal Big Data in Distributed Environment

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价