基于机器学习的数据库多表连接顺序选择研究综述

doi:10.19678/j.issn.1000-3428.0068808

摘要/Abstract

摘要：

多表连接顺序选择是指在进行查询优化时为查询语句中涉及的多个表选择最优的连接顺序以提升查询性能。在复杂查询中，不同的表连接顺序能够显著影响查询执行效率。在大数据时代，面对庞大的数据集、多样的应用环境以及复杂的查询语句，基于启发式规则的传统多表连接顺序算法无法根据环境动态适应和自我学习，缺乏泛化能力，因此选择次优的多表连接顺序，甚至会严重影响查询性能。随着机器学习技术的蓬勃发展，面向数据库的人工智能(AI4DB)技术逐渐引领查询优化领域。机器学习技术能够解决传统连接顺序选择算法存在的问题，在自我学习以及场景适应方面具有较好表现。首先介绍连接顺序的传统选择算法，挖掘其存在的问题，然后总结当前主流的针对多表连接的机器学习模型，并分别介绍它们的核心技术方案，在效果、可用场景等方面对它们进行横向对比，为该领域后续科研工作者提供有价值的参考。

关键词: 数据库, 查询优化, 机器学习, 连接顺序, 面向数据库的人工智能

Abstract:

Multi-table join order selection refers to the process of determining the optimal join sequence among the tables involved in a query during query optimization, to improve execution performance. In complex queries, different join orders can significantly affect query efficiency. In the era of big data, traditional join order selection algorithms, which typically based on heuristic rules, are challenged by massive datasets, diverse application scenarios, and complex query workloads. Their inability to dynamically adapt to environmental changes or to self-improve through learning affects the generalizability of these models, often resulting in suboptimal join orders that can severely degrade query performance. With the rapid advancement of machine learning, Artificial Intelligence for Databases (AI4DB) has emerged as a transformative approach to query optimization. Machine learning-based techniques address the limitations of traditional methods by enabling self-learning and context-aware adaptations. This study first reviews classical join order selection algorithms and then analyzes their inherent limitations. Next, state-of-the-art machine learning models for multi-table join optimization are systematically summarized, detailing their core technical designs. A comparative analysis is provided in terms of effectiveness and applicable scenarios, offering valuable insights for future research in this field.

Key words: database, query optimization, machine learning, join order, Artificial Intelligence for Databases (AI4DB)

王浩, 高锦涛, 王杰. 基于机器学习的数据库多表连接顺序选择研究综述[J]. 计算机工程, 2025, 51(7): 31-46.

WANG Hao, GAO Jintao, WANG Jie. Review of Multi-table Join Order Selection in Databases Based on Machine Learning[J]. Computer Engineering, 2025, 51(7): 31-46.

https://www.ecice06.com/CN/Y2025/V51/I7/31

图/表 9

图1 查询处理流程

Fig.1 Query processing flow

图2 数据库表连接顺序研究思路

Fig.2 Research idea of database table join order

图3 RTOS框架

Fig.3 RTOS framework

图4 Bao执行流程

Fig.4 Bao execution flow

图5 ReJoin执行流程

Fig.5 ReJoin execution flow

图6 实验结果对比

Fig.6 Comparison of experimental results

参考文献 53

1	SELINGER P G, ASTRAHAN M M, CHAMBERLIN D D, et al. Access path selection in a relational database management system[C]//Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 1979: 23-34.
2	李国良, 周煊赫, 孙佶, 等. 基于机器学习的数据库技术综述. 计算机学报, 2020, 43 (11): 2019- 2049. URL
	LI G L , ZHOU X H , SUN J , et al. A survey of machine learning based database techniques. Chinese Journal of Computers, 2020, 43 (11): 2019- 2049. URL
3	LEIS V , GUBICHEV A , MIRCHEV A , et al. How good are query optimizers, really?. Proceedings of the VLDB Endowment, 2015, 9 (3): 204- 215.
4	IBARAKI T , KAMEDA T . On the optimal nesting order for computing N-relational joins. ACM Transactions on Database Systems, 1984, 9 (3): 482- 502.
5	BABCOCK B, CHAUDHURI S. Towards a robust query optimizer: a principled and practical approach[C]//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2005: 1-8.
6	周维清. 基于学习的数据库查询优化方法研究[D]. 成都: 电子科技大学, 2023.
	ZHOU W Q. Research on database query optimization method based on learning[D]. Chengdu: University of Electronic Science and Technology of China, 2023. (in Chinese)
7	陈婷. 面向复杂连接的连接顺序选择策略评测方法[D]. 上海: 华东师范大学, 2023.
	CHEN T. Evaluation method of connection sequence selection strategy for complex connections[D]. Shanghai: East China Normal University, 2023. (in Chinese)
8	VAN AKEN D, PAVLO A, GORDON G J, et al. Automatic database management system tuning through large-scale machine learning[C]//Proceedings of the 2017 ACM International Conference on Management of Data. New York, USA: ACM Press, 2017: 1009-1024.
9	MARCUS R, NEGI P, MAO H, et al. Neo: a learned query optimizer[EB/OL]. [2023-10-07]. https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2022_2023/papers/marcus_VLDB_2019.pdf.
10	ARULKUMARAN K , DEISENROTH M P , BRUNDAGE M , et al. Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine, 2017, 34 (6): 26- 38. URL
11	HUA H Z, WEN G X, WU K G. Building decision forest via deep reinforcement learning[C]//Proceedings of the International Joint Conference on Neural Networks (IJCNN). Washington D.C., USA: IEEE Press, 2023: 1-8.
12	BORDAWEKAR R, SHMUELI O. Using word embedding to enable semantic queries in relational databases[C]//Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. New York, USA: ACM Press, 2017: 1-4.
13	MNIH V , KAVUKCUOGLU K , SILVER D , et al. Human-level control through deep reinforcement learning. Nature, 2015, 518 (7540): 529- 533. doi: 10.1038/nature14236
14	KARNAGEL T , HABICH D , LEHNER W . Adaptive work placement for query processing on heterogeneous computing resources. Proceedings of the VLDB Endowment, 2017, 10 (7): 733- 744. doi: 10.14778/3067421.3067423
15	SCHAAL S . Learning from demonstration. Berlin, Germany: Springer, 1996.
16	ZHANG J. AlphaJoin: join order selection a la AlphaGo[EB/OL]. [2023-10-07]. https://ssc.io/publication/alphajoin-join-order-selection-a-la-alpha-go-vldb-phd/.
17	SILVER D , HUANG A , MADDISON C J , et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489. doi: 10.1038/nature16961
18	YU X, LI G L, CHAI C L, et al. Reinforcement learning with Tree-LSTM for join order selection[C]//Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE). Washington D.C., USA: IEEE Press, 2020: 1297-1308.
19	TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1503.00075.
20	KRASKA T, ALIZADEH M, BEUTEL A, et al. SageDB: a learned database system[C]//Proceedings of 9th Biennial Conference on Innovative Data Systems Research. Cambridge, USA: MIT, 2021: 1-10.
21	MARCUS R, NEGI P, MAO H Z, et al. Bao: making learned query optimization practical[C]//Proceedings of the 2021 International Conference on Management of Data. New York, USA: ACM Press, 2021: 1275-1288.
22	CHAPELLE O, LI L. An empirical evaluation of Thompson sampling[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2011: 2249-2257.
23	NEGI P, MARCUS R, MAO H Z, et al. Cost-guided cardinality estimation: focus where it matters[C]//Proceedings of the IEEE 36th International Conference on Data Engineering Workshops (ICDEW). Washington D.C., USA: IEEE Press, 2020: 154-157.
24	MOU L L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2016: 1287-1293.
25	YU X , CHAI C L , LI G L , et al. Cost-based or learning-based?. Proceedings of the VLDB Endowment, 2022, 15 (13): 3924- 3936.
26	LIPTON Z C, BERKOWITZ J, ELKAN C, et al. A critical review of recurrent neural networks for sequence learning[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1506.00019v4.
27	HVLLERMEIER E , WAEGEMAN W . Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning, 2021, 110 (3): 457- 506. doi: 10.1007/s10994-021-05946-3
28	OVADIA Y, FERTIG E, REN J, et al. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1906.02530.
29	KIM G S, PAIK M C. Contextual multi-armed bandit algorithm for semiparametric reward model[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: PMLR, 2019: 3389-3397.
30	CHEN T Y , GAO J , CHEN H D , et al. LOGER: a learned optimizer towards generating efficient and robust query execution plans. Proceedings of the VLDB Endowment, 2023, 16 (7): 1777- 1789.
31	DWIVEDI V P, BRESSON X. A generalization of transformer networks to graphs[EB/OL]. [2023-10-07]. https://arxiv.org/abs/2012.09699v2.
32	王江晴, 王雪言, 孙翀, 等. 用于多表连接优化的深度强化学习嵌入表示. 计算机工程与设计, 2023, 44 (2): 576- 581.
	WANG J Q , WANG X Y , SUN C , et al. Deep reinforcement learning embedding representation for multi-relation join optimization. Computer Engineering and Design, 2023, 44 (2): 576- 581.
33	MARCUS R, PAPAEMMANOUIL O. Deep reinforcement learning for join order enumeration[C]//Proceedings of the 1st International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. New York, USA: ACM Press, 2018: 1-4.
34	KRISHNAN S, YANG Z, GOLDBERG K, et al. Learning to optimize join queries with deep reinforcement learning[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1808.03196.
35	NEUMANN T, RADKE B. Adaptive optimization of very large join queries[C]//Proceedings of the 2018 International Conference on Management of Data. New York, USA: ACM Press, 2018: 677-692.
36	YANG Z H, CHIANG W L, LUAN S F, et al. Balsa: learning a query optimizer without expert demonstrations[C]//Proceedings of the 2022 International Conference on Management of Data. New York, USA: ACM Press, 2022: 931-944.
37	LEIS V , RADKE B , GUBICHEV A , et al. Query optimization through the looking glass, and what we found running the Join Order Benchmark. The VLDB Journal, 2018, 27 (5): 643- 668. doi: 10.1007/s00778-017-0480-7
38	TOBIN J, FONG R, RAY A, et al. Domain randomization for transferring deep neural networks from simulation to the real world[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D.C., USA: IEEE Press, 2017: 23-30.
39	CHEN J, YE G Y, ZHAO Y, et al. Efficient join order selection learning with graph-based representation[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2022: 97-107.
40	PEROZZI B, AL-RFOU R, SKIENA S. DeepWalk: online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2014: 701-710.
41	WANG X , CHEN Y D , ZHU W W . A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (9): 4555- 4576. doi: 10.1109/TPAMI.2021.3069908
42	ZHOU W Q, ZHAN S Y, DAI B, et al. SOAR: a learned join order selector with graph attention mechanism[C]//Proceedings of the International Joint Conference on Neural Networks (IJCNN). Washington D.C., USA: IEEE Press, 2022: 1-8.
43	KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL]. [2023-10-07]. https://arxiv.org/abs/1609.02907v4.
44	TZOUMAS K, SELLIS T, JENSEN C S. A reinforcement learning approach for adaptive query processing[EB/OL]. [2023-10-07]. https://www.researchgate.net/publication/241437155_A_Reinforcement_Learning_Approach_for_Adaptive_Query_Processing.
45	ZHU R, CHEN W, DING B L, et al. Lero: a learning-to-rank query optimizer[EB/OL]. [2023-10-07]. https://arxiv.org/abs/2302.06873v2.
46	HAN Y X, WU Z N, WU P Z, et al. Cardinality estimation in DBMS: a comprehensive benchmark evaluation[EB/OL]. [2023-10-07]. https://arxiv.org/abs/2109.05877.
47	LIU T Y . Learning to rank for information retrieval. Berlin, Germany: Springer, 2011.
48	赵润哲. 基于深度强化学习的数据库查询优化方法研究[D]. 郑州: 郑州大学, 2022.
	ZHAO R Z. Research on database query optimization method based on deep reinforcement learning[D]. Zhengzhou: Zhengzhou University, 2022. (in Chinese)
49	KRASKA T, BEUTEL A, CHI E H, et al. The case for learned index structures[C]//Proceedings of the 2018 International Conference on Management of Data. New York, USA: ACM Press, 2018: 489-504.
50	CAPPUZZO R, PAPOTTI P. Creating embeddings of heterogeneous relational datasets for data integration tasks[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2020: 1335-1349.
51	SUN J , LI G L . An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 2019, 13 (3): 307- 319. doi: 10.14778/3368289.3368296
52	LI G L , ZHOU X H , LI S F , et al. QTune. Proceedings of the VLDB Endowment, 2019, 12 (12): 2118- 2130.
53	SABEK I, UKYAB T S, KRASKA T. LSched: a workload-aware learned query scheduler for analytical database systems[C]//Proceedings of the 2022 International Conference on Management of Data. New York, USA: ACM Press, 2022: 1228-1242.

[1]	庄紫薇, 朱俊国. 面向多源文本的越南语文本检错方法[J]. 计算机工程, 2025, 51(5): 93-102.
[2]	邓泽先, 张云贵, 张琳. 基于预训练递归Transformer-Mixer的多维时间序列分类研究[J]. 计算机工程, 2025, 51(5): 154-165.
[3]	徐明亮, 李芳媛, 马浩然, 何飞. 大规模神经记录的峰电位聚类算法(特邀)[J]. 计算机工程, 2024, 50(6): 1-34.
[4]	李永飞, 李铭洋, 常鑫, 曹可欣. 基于可解释性深度学习的物联网水质监测数据异常检测[J]. 计算机工程, 2024, 50(6): 179-187.
[5]	莫少聪, 陈庆锋, 谢泽, 刘春雨, 邱俊铼. 基于动态图注意力与标签传播的实体对齐[J]. 计算机工程, 2024, 50(4): 150-159.
[6]	陈琳, 范元凯, 何震瀛, 刘晓清, 杨阳, 汤路民. SQL-to-text模型的组合泛化能力评估方法[J]. 计算机工程, 2024, 50(3): 326-335.
[7]	孙毅, 王会梅, 鲜明, 向航. Kubeflow异构算力调度策略研究[J]. 计算机工程, 2024, 50(2): 25-32.
[8]	张财, 马自强, 闫博. 基于机器学习的政务微博情感分析模型设计[J]. 计算机工程, 2024, 50(12): 386-395.
[9]	董星星, 高继勋, 王晓桐, 李松. 空间方向关系表达与推理模型研究综述[J]. 计算机工程, 2023, 49(9): 1-15.
[10]	郭家鼎, 王鹏. 基于数据仓库的典型图查询处理技术[J]. 计算机工程, 2023, 49(9): 32-42.
[11]	陈治旭, 靳雁霞, 芦烨, 杨晶, 刘亚变, 史志儒. 基于子图卷积神经网络的多精度服装建模方法[J]. 计算机工程, 2023, 49(4): 174-181.
[12]	刘金硕, 詹岱依, 邓娟, 王丽娜. 基于深度神经网络和联邦学习的网络入侵检测[J]. 计算机工程, 2023, 49(1): 15-21,30.
[13]	葛昕, 邹福泰, 郭万达, 谭越, 李林森. 社交僵尸网络发展综述[J]. 计算机工程, 2022, 48(8): 12-24.
[14]	俞莎莎, 牛保宁. 基于交易不可信度的比特币非法交易检测[J]. 计算机工程, 2022, 48(8): 166-172.
[15]	金海波, 赵欣越. 共形预测框架下的高可靠入侵检测算法[J]. 计算机工程, 2022, 48(7): 130-140.

选择文件类型/文献管理软件名称

选择包含的内容