Fine-Tuning Large Language Models for Text-to-SQL Using Table Creation Information

doi:10.19678/j.issn.1000-3428.0252484

Abstract

Abstract: The text-to-SQL task aims to automatically convert natural language queries into structured query language (Structured Query Language), serving as a key technology to enable non-technical users to access databases efficiently, thereby significantly improving data utilization.To address the challenge of large language models insufficiently understanding database schema information in prompts for text-to-SQL tasks, this paper proposes a table creation information-based fine-tuning method for large language models. Existing approaches often rely on complex, lengthy prompt templates or extensive fine-tuning data, facing two major bottlenecks: (1) The inclusion of complete prompt content in the templates dilutes the few critical cues, leading to attention dispersion in long-context understanding and consequently reducing inference performance; (2) The method requires manual collection and processing of tens of thousands of samples for large-scale fine-tuning to enable the model to achieve stable comprehension capability in text-to-SQL tasks after fine-tuning. To mitigate these issues, we propose a hybrid text-to-SQL generation strategy that integrates prompt engineering with fine-tuning. This method selects semantically relevant table creation information based on question similarity and combines it with concise prompt templates to construct a lightweight, manually curated fine-tuning dataset. Through supervised fine-tuning, the dataset guides large language models to better comprehend table schema information in prompts, enhancing their ability to capture relationships between tables and queries, thereby generating more accurate SQL statements. Experimental results demonstrate that the proposed method effectively reduces the model's reliance on extraneous information in prompt templates and mitigates attention dispersion during reasoning. The generated SQL queries achieve an execution accuracy of 83.37% , representing a 0.49 percentage point improvement over the baseline approach.

摘要： 文本转SQL任务旨在将自然语言查询自动转换为结构化查询语言(Structured Query Language)，是实现非技术人员便捷访问数据库的关键技术，对提升数据利用效率具有重要意义。针对大语言模型在文本转SQL任务中面临的模型对提示模板中数据库模式信息理解不足的问题，提出了一种基于表格创建信息微调的大模型文本转SQL方法。现有方法多依赖于复杂冗长的提示模板设计或大量的微调数据，面临着两大瓶颈：(1)包含完整提示内容的模板稀释了其中少数的关键提示，导致模型在长上下文理解中出现注意力分散，降低了推理性能；(2)需要人工收集和处理上万样本的大规模微调数据才能使模型微调后获得稳定的文本转SQL任务理解能力。为此，提出了一种融合提示工程和微调的文本转SQL生成策略，通过语义相似度筛选出与问题最相关的表格创建信息，并结合精简的提示模板构建轻量的人工微调数据集。该数据集通过监督微调指导大语言模型理解提示模板中的表格模式信息，增强模型对表格与问题之间关联的掌握能力，从而生成更准确的SQL查询语句。实验证明，所提方法能够有效降低模型对提示模板中额外信息的依赖，缓解模型在推理时产生的注意力分散问题，生成SQL查询的执行精准度达到83.37％，相较于基准方法提高了0.49个百分点。

DING Lin, YANG Yang, GUO Caili, GUO JianZhang, LI Zheng. Fine-Tuning Large Language Models for Text-to-SQL Using Table Creation Information[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252484.

丁霖, 杨洋, 郭彩丽, 郭建章, 李政. 基于表格创建信息微调的大模型文本转Sql方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252484.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252484

References

[1] 刘雨蒙,赵怡婧,王碧聪,等. 结构化数据库查询语言智能合成技术研究进展[J]. 计算机科学, 2024,51(7):40-48. Liu Y M, Zhao Y J, Wang B C, et al.Advances in SQL Intelligent Synthesis Technology[J]. Computer Science, 2024,51(7):40-48. (in Chinese)
[2] Qin B, Hui B, Wang L, et al. A survey on text-to-sql parsing: Concepts, methods, and future directions[J]. arXiv preprint arXiv:2208.13629, 2022.
[3] Pourreza M, Rafiei D, Feng Y, et al. Sql-encoder: Improving nl2sql in-context learning through a context-aware encoder[J]. arXiv preprint arXiv: 2403.16204, 2024.
[4] Guo J, Zhan Z, Gao Y, et al. Towards complex text-to-sql in cross-domain database with intermediate representation[J]. arXiv preprint arXiv:1905.08205, 2019.
[5] Shi L, Tang Z, Zhang N, et al. A survey on employing large language models for text-to-sql tasks[J]. arXiv preprint arXiv:2407.15186, 2024.
[6] Zhang B, Ye Y, Du G, et al. Benchmarking the text-to-sql capability of large language models: A comprehensive evaluation[J]. arXiv preprint arXiv:2403.02951, 2024.
[7] Hong Z, Yuan Z, Zhang Q, et al. Next-generation database interfaces: A survey of llm-based text-to-sql[J]. arXiv preprint arXiv:2406.08426, 2024.
[8] Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models[J]. ACM transactions on intelligent systems and technology, 2024, 15(3): 1-45.
[9] Xie Y, Jin X, Xie T, et al. Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm[C]//Findings of the Association for Computational Linguistics ACL 2024. 2024: 10796-10816.
[10] Tai C Y, Chen Z, Zhang T, et al. Exploring Chain of Thought Style Prompting for Text-to-SQL[C]//Proce- edings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 5376-5393.
[11] Zhuang A, Zhang G, Zheng T, et al. Structlm: Towards building generalist models for structured knowledge grounding[J]. arXiv preprint arXiv:2402.16671, 2024.
[12] Roziere B, Gehring J, Gloeckle F, et al. Code llama: Open foundation models for code[J]. arXiv preprint arXiv:2308.12950, 2023.
[13] Li B, Luo Y, Chai C, et al. The Dawn of Natural Language to SQL: Are We Fully Ready?[J]. Proceedings of the VLDB Endowment, 2024, 17(11): 3318-3331.
[14] Gao Y, Liu Y, Li X, et al. Xiyan-sql: A multi-generator ensemble framework for text-to-sql[J]. arXiv preprint arXiv:2411.08599, 2024.
[15] Chen X, Wang T, Qiu T, et al. Open-sql framework: Enhancing text-to-sql on open-source large language models[J]. arXiv preprint arXiv:2405.06674, 2024.
[16] Nan L, Zhao Y, Zou W, et al. Enhancing text-to-SQL capabilities of large language models: A study on prompt design strategies[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. 2023: 14935-14956.
[17] 于晓昕,何东,叶子铭,等. 一种利用词典扩展数据库模式信息的Text2SQL方法[J]. 四川大学学报（自然科学版）, 2024,61(1):78-88. Yu X Q, He D, Ye Z M, et al.A Text2SQL method utilizing database schema information expanded by dictionary[J]. Journal of Sichuan University（Natural Science Edition）, 2024,61(1):78-88. (in Chinese)
[18] Gao D, Wang H, Li Y, et al. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation[J]. Proceedings of the VLDB Endowment, 2024, 17(5): 1132-1145.
[19] Li J, Hui B, Qu G, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls[J]. Advances in Neural Information Processing Systems, 2023, 36: 42330-42357.
[20] 刘洋, 廖薇, 徐震. 融合表字段的NL2SQL多任务学习方法 [J]. 计算机应用研究, 2024, 41 (9): 2800-2804. Liu Y, Liao W, Xu Z. Multi-task learning method for NL2SQL with fused table columns [J]. Application Research of Computers, 2024, 41 (9): 2800-2804. (in Chinese)
[21] Shi J, Xu B, Liang J, et al. Gen-SQL: Efficient Text-to-SQL by bridging natural language question anddatabase schema with pseudo-schema[C]//Proceedings of the 31st International Conference on Computational Linguistics. 2025: 3794-3807.
[22] Qu G, Li J, Li B, et al. Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation[C]//Findings of the Association for Computational Linguistics ACL 2024. 2024: 5456-5471.
[23] Pourreza M, Rafiei D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction[J]. Advances in Neural Information Processing Systems, 2023, 36: 36339-36348.
[24] Yu T, Zhang R, Yang K, et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3911-3921.
[25] Thorpe D G, Duberstein A J, Kinsey I A. Dubo-sql: Diverse retrieval-augmented generation and fine tuning for text-to-sql[J]. arXiv preprint arXiv:2404.12560, 2024.
[26] Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. ICLR, 2022, 1(2): 3.
[27] Yang A, Yang B, Zhang B, et al. Qwen2. 5 technical report[J]. arXiv preprint arXiv:2412.15115, 2024.
[28] Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint arXiv:2302.13971, 2023.
[29] Wang W, Wei F, Dong L, et al. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers[J]. Advances in neural information processing systems, 2020, 33: 5776-5788.
[30] Deng X, Hassan A, Meek C, et al. Structure-Grounded Pretraining for Text-to-SQL[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021: 1337-1350.
[31] Choi D H, Shin M C, Kim E G, et al. Ryansql: Recursively applying sketch-based slot fillings for complex text-to-sql in cross-domain databases[J]. Computational Linguistics, 2021, 47(2): 309-332.
[32] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.

Please choose a citation manager

Content to export