Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Text-to-SQL Approach Based on Hierarchical Entity Indexing

  

  • Published:2025-12-30

基于层次化实体索引的自然语言转SQL方法

Abstract: Text-to-SQL technology aims to lower the barrier to database querying, enabling non-technical users to interact with databases through natural language. However, existing approaches face two major challenges: first, large language models have limited capability in generating complex SQL queries; second, in real-world production environments, databases are often large-scale, and directly inputting the complete database structure leads to excessively long prompts, increased computational costs, and reduced generation accuracy. The simplicity of traditional benchmark datasets compared with the complexity of real-world scenarios further exacerbates this issue. To address these problems, this study proposes a Text-to-SQL method based on hierarchical entity indexing. The core idea is to enhance retrieval-augmented generation by dynamically filtering database information relevant to user queries, thereby enriching the contextual knowledge provided in prompts. Experiments conducted on open-source datasets and production data verify the effectiveness of the proposed approach. The results show that the SQL generation accuracy of this method is only 0.4% lower than the top-ranked (undisclosed) approach on the Spider leaderboard, while outperforming the second-ranked method by 4.2%, demonstrating its effectiveness. Future research directions include refining entity partitioning strategies and optimizing the index architecture to support real-time retrieval in ultra-large-scale databases. This work provides an efficient and scalable solution for practical Text-to-SQL systems.

摘要: 自然语言转SQL技术旨在降低数据库查询的使用门槛,使非技术人员能够通过自然语言与数据库交互。然而,现有方法面临两大挑战:一是大语言模型在生成复杂SQL时存在能力限制;二是实际生产环境中数据库规模庞大,直接输入完整数据库结构会导致提示词过长,增加计算成本并降低生成准确性。传统基准数据集的简单性与现实场景的复杂度差异进一步加剧了这一问题。为解决上述问题,研究提出了一种基于层次化实体索引的自然语言转SQL方法,其核心是通过优化检索增强生成技术动态筛选与用户查询相关的数据库信息,从而完善提供提示词中的背景知识信息。实验基于开源数据集和生产环境数据验证了方法的有效性。实验结果表明,研究方法的SQL生成准确率仅比Spider榜单上排名第一的未公开方法低0.4,比排名第二的方法高4.2,说明了方法的有效性。未来研究方向包括细化实体划分策略及优化索引架构以支持超大规模数据库实时检索。该研究工作为实际场景下的自然语言转SQL系统提供了高效、可扩展的解决方案。