Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Lattice Structure Based Distributed Table Joining Optimization

  

  • Published:2025-07-03

基于格结构的分布式表连接优化方法

Abstract: In distributed computing frameworks, inefficient data transfer in the Shuffle phase has become a key bottleneck in data connectivity. Existing methods have certain limitations in dealing with table joins, such as broadcast joins and hash joins in Spark are both susceptible to data skewing, which makes the load between nodes unbalanced. Aiming at this problem, the paper focuses on joining aggregated queries, and proposes a table joining method based on lattice structure: by precomputing the storage table partition data in the form of lattice structure, and utilizing the convex set property of equivalence class, i.e., the data cells containing the upper bound of equivalence class and contained by the lower bound of equivalence class, whose aggregation values are equal to the aggregation values of equivalence class, so as to realize the quick matching and Calculation. Since the query data cells as a compressed form of basic table data, the data size and skew are more concise and uniform, the article uses the query data cells instead of table data to perform data transfer and connection, which greatly reduces the data Shuffle and computational complexity. The method proposed in the paper has been implemented in Spark, and experiments based on the TPC-H dataset show that: the method of the paper reduces the data Shuffle by about 45.06% in large dataset scenarios, meanwhile, the workload among the nodes is more balanced compared to the benchmark method, and the query response time is shortened by 14.23% on average.

摘要: 在分布式计算框架中,Shuffle阶段的数据传输效率低下已成为数据连接的关键瓶颈。现有方法在处理表连接时存在一定的局限性,如Spark中的广播连接和哈希连接均易受数据倾斜影响,使得节点之间负载不均衡。针对此问题,文章聚焦于连接聚集查询,提出一种基于格结构的表连接方法:通过预计算存储表分区数据为格结构形式,利用等价类的凸集性质,即包含等价类上界且被等价类下界所包含的数据单元,其聚集值与等价类聚集值相等,从而实现对查询语句所映射生成的查询单元进行快速匹配和计算。由于查询单元作为基本表数据的一种压缩形式,数据大小和倾斜度更加简洁、均匀,文章使用查询单元代替基本表数据执行数据的传输和连接,极大程度上减少了Shuffle阶段的数据大小和计算复杂度。文章所提方法已在Spark中实现,基于TPC-H数据集的实验表明:文章方法在大数据集场景中减少数据Shuffle约45.06%,同时,节点间的工作负载相较于基准方法更加均衡,查询响应时间平均缩短了14.23%。