Text-to-SQL Approach Based on Hierarchical Entity Indexing

doi:10.19678/j.issn.1000-3428.02521022

Abstract

Abstract: Text-to-SQL technology aims to lower the barrier to database querying, enabling non-technical users to interact with databases through natural language. However, existing approaches face two major challenges: first, large language models have limited capability in generating complex SQL queries; second, in real-world production environments, databases are often large-scale, and directly inputting the complete database structure leads to excessively long prompts, increased computational costs, and reduced generation accuracy. The simplicity of traditional benchmark datasets compared with the complexity of real-world scenarios further exacerbates this issue. To address these problems, this study proposes a Text-to-SQL method based on hierarchical entity indexing. The core idea is to enhance retrieval-augmented generation by dynamically filtering database information relevant to user queries, thereby enriching the contextual knowledge provided in prompts. Experiments conducted on open-source datasets and production data verify the effectiveness of the proposed approach. The results show that the SQL generation accuracy of this method is only 0.4% lower than the top-ranked (undisclosed) approach on the Spider leaderboard, while outperforming the second-ranked method by 4.2%, demonstrating its effectiveness. Future research directions include refining entity partitioning strategies and optimizing the index architecture to support real-time retrieval in ultra-large-scale databases. This work provides an efficient and scalable solution for practical Text-to-SQL systems.

摘要： 自然语言转SQL技术旨在降低数据库查询的使用门槛，使非技术人员能够通过自然语言与数据库交互。然而，现有方法面临两大挑战：一是大语言模型在生成复杂SQL时存在能力限制；二是实际生产环境中数据库规模庞大，直接输入完整数据库结构会导致提示词过长，增加计算成本并降低生成准确性。传统基准数据集的简单性与现实场景的复杂度差异进一步加剧了这一问题。为解决上述问题，研究提出了一种基于层次化实体索引的自然语言转SQL方法，其核心是通过优化检索增强生成技术动态筛选与用户查询相关的数据库信息，从而完善提供提示词中的背景知识信息。实验基于开源数据集和生产环境数据验证了方法的有效性。实验结果表明，研究方法的SQL生成准确率仅比Spider榜单上排名第一的未公开方法低0.4，比排名第二的方法高4.2，说明了方法的有效性。未来研究方向包括细化实体划分策略及优化索引架构以支持超大规模数据库实时检索。该研究工作为实际场景下的自然语言转SQL系统提供了高效、可扩展的解决方案。

Le Chen, Zhongliang Xiao, Jia Chen, Lihua Chen, Xiaolei Chen , Peng Wang , Wei Wang. Text-to-SQL Approach Based on Hierarchical Entity Indexing[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.02521022.

陈乐, 肖忠良, 陈佳, 陈理华, 陈晓磊, 王鹏, 汪卫. 基于层次化实体索引的自然语言转SQL方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.02521022.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.02521022

References

[1] CODD E F. A relational model of data for large shared data banks[J]. Communications of the ACM, 1970, 13(6): 377–387.
[2] CHAMBERLIN D D, BOYCE R F. SEQUEL: A structured English query language[C]//Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on data description, access and control. 1974: 249–264.
[3] 刘译璟.基于自然语言处理和深度学习的NL2SQL技术及其在BI增强分析中的应用[J].中国信息化,2019,00(11)62-67.doi : 10.3969/j.issn.1672-5158.2019.11.032 LIU Yijing. NL2SQL Technology Based on Natural Language Processing and Deep Learning and Its Application in BI-Enha
nced Analysis[J]. China Informatization, 2019, 00(11): 62-67. doi: 10.3969/j.issn.1672-5158.2019.11.032 [4] LUO Y, WANG W, LIN X, et al. SPARK2: Top-k keyword query in relational databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(12): 1763–1780.
[5] SAHA D, FLORATOU A, SANKARANARAYANAN K, et al. ATHENA: an ontology-driven system for natural language querying over relational data stores[J]. VLDB Endowment, 2016, 9(12):1209-1220.
[6] LI F, JAGADISH H V. NaLIR: an interactive natural language interface for querying relational databases[C]// Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York, USA: Association for Computing Machinery, 2014: 709–712.
[7] SCHOLAK T, SCHUCHER N, BAHDANAU D. PICARD: parsing incrementally for constrained auto-regressive decoding from language models[C]//In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021:9895–9901.
[8] XU K, WANG Y, WANG Y, et al. SeaD: end-to-end text-to-sql generation with schema-aware denoising[C]//Findings of the Association for Computational Linguistics: NAACL 2022. Seattle, United States: Association for Computational Linguistics, 2022: 1845-1853.
[9] 王秋月.基于知识增强的NL2SQL方法[J]. 智能计算机与应用, 2022, 12(07): 1-7. doi: 10.3969/j.issn.2095-2163.2022.07.002 WANG Qiuyue. Knowledge-enhanced NL2SQL Method[J]. Intelligent Computer and Applications, 2022, 12(07): 1-7. doi: 10.3969/j.issn.2095-2163.2022.07.002
[10] FU H, LIU C, WU B, et al. CatSQL: Towards real world natural language to SQL applications[J]. Proc. VLDB Endow., 2023, 16(6): 1534–1547.
[11] GAO D, WANG H, LI Y, et al. Text-to-SQL empowered by large language models: A benchmark evaluation[J]. Proc. VLDB Endow., 2024, 17(5): 1132–1145.
[12] TAN Z, LIU X, SHU Q, et al. Enhancing text-to-SQL capabilities of large language models through tailored promptings[C]//In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Torino, Italia: ELRA and ICCL, 2024: 6091–6109.
[13] REN T, FAN Y, HE Z, et al. PURPLE: Making a large language model a better SQL writer[C]//2024 IEEE 40th International Conference on Data Engineering. Utrecht, Netherlands: IEEE, 2024: 15–28.
[14] LIU A, HU X, WEN L, et al. A comprehensive evaluation of ChatGPT’s zero-shot text-to-SQL capability[EB/OL]. [2025-07-30]. https://arxiv.org/abs/2303.13547.
[15] CHOWDHERY A, NARANG S, DEVLIN J, et al. PaLM: scaling language modeling with pathways[J]. J. Mach. Learn. Res., 2023, 24(1): 11324 - 11436.
[16] RAJKUMAR N, LI R, BAHDANAU D. Evaluating the text-to-SQL capabilities of large language models[EB/OL]. [2025-07-30]. https://arxiv.org/abs/2204.00498.
[17] 刘雪颖. 基于大型语言模型的检索增强生成综述[J]. 计算机工程与应用, 2025, 61(13): 1–25. DOI: 10.3778/j.issn.1002-8331.2410-0088. Liu Xueying. Survey on retrieval-augmented generation based on large language models[J]. Computer Engineering and Applications, 2025, 61(13): 1–25. DOI: 10.3778/j.issn.1002-8331.2410-0088.
[18] YU T, ZHANG R, YANG K, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task [C]//In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018:3911–3921.
[19] Wang S, Ding L, Zhan Y, Luo Y, Liu S, Ding W. Fuzzy-Assisted Contrastive Decoding: Improving Code Generation of Large Language Models[J]. IEEE Transactions on Fuzzy Systems, 2025.
[20] He K, Liu M, Wang C, Li Z, Wang Y, Peng X, Zheng Z. AdaDec: Uncertainty-Guided Adaptive Decoding for LLM-based Code Generation[J]. arXiv preprint arXiv:2506.08980, 2025.
[21] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877–1901.
[22] ACHIAM O J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2025-07-30]. https://cdn.openai. com/papers/gpt-4.pdf.
[23] FAN Y, HE Z, REN T, et al. MetaSQL: A generate-then-rank framework for natural language to SQL translation[C]//2024 IEEE 40th International Conference on Data Engineering. Utrecht, Netherlands: IEEE, 2024: 1765–1778.
[24] TAI C Y, CHEN Z, ZHANG T, et al. Exploring chain of thought style prompting for text-to-SQL[C]//In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 5376–5393.
[25] ARORA A, BHAISAHEB S, NIGAM H, et al. Adapt and decompose: efficient generalization of text-to-SQL via domain adapted least-to-most prompting[C]//Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP. Singapore: Association for Computational Linguistics, 2023: 25–47.
[26] GAO Y, XIONG Y, GAO X, et al. Retrieval-augmented generation for large language models: a survey[EB/OL]. [2025-07-30]. https://arxiv.org/abs/2312.10997.
[27] Xiao S, Liu Z, Zhang P, et al. C-Pack: Packed resources for general chinese embeddings[C]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. Washington DC, USA: Association for Computing Machinery, 2024: 641–649.
[28] Enevoldsen, Kenneth C. et al. MMTEB: massive multilingual text embedding benchmark[EB/OL]. [2025-07-30]. https://arxiv.org/abs/2502.13595.
[29] Tarjan, Robert Endre. Data structures and network algorithms[M]. USA: Society for Industrial and Applied Mathematics, 1983.
[30] LI Z, WANG X, ZHAO J, et al. PET-SQL: a prompt-enhanced two-round refinement of text-to-SQL with cross-consistency[EB/OL]. [2025-07-39]. https://arxi v.org/abs/2403.09732
[31] TALAEI S, POURREZA M, CHANG Y C, et al. CHESS: contextual harnessing for efficient SQL synthesis[EB/OL]. [2025-07-30]. https://arxiv.org/abs/2405.16755.
[32] DB-GPT VLDB 2024 Xue S, Qi D, Jiang C, et al. Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models[J].
[33] Xue S, Qi D, Jiang C, et al. Demonstration of DB-GPT: next generation Data interaction system empowered by large language models[EB/OL]. [2025-07-30]. https://arxi v.org/abs/2404.10209.
[34] DIN-SQL NeurIPS 2023 Pourreza M, Rafiei D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction[J]. Advances in Neural Information Processing Systems, 2023, 36: 36339-36348.

Please choose a citation manager

Content to export