面向研究生招生咨询的中文Text-to-SQL模型

doi:10.19678/j.issn.1000-3428.0068504

摘要/Abstract

摘要：

研究生招生咨询是一种具有代表性的短时间高频次问答应用场景。针对现有基于词向量等方法的招生问答系统返回答案不够精确, 以及每年需要更新问题库的问题, 引入了基于文本转结构化查询语言(Text-to-SQL)技术的RESDSQL模型, 可将自然语言问题转化为SQL语句后到结构化数据库中查询答案并返回。搜集了研究生招生场景中的高频咨询问题, 根据3所高校真实招生数据, 构建问题与SQL语句模板, 通过填充模板的方式构建数据集, 共有训练集1 501条、测试集386条。将RESDSQL的RoBERTa模型替换为具有更强多语言生成能力的XLM-RoBERTa模型、T5模型替换为mT5模型, 并在目标领域数据集上进行微调, 在招生领域问题上取得了较高的准确率, 在mT5-large模型上执行正确率为0.95, 精确匹配率为1。与基于ChatGPT3.5模型、使用零样本提示的C3SQL方法对比, 该模型性能与成本均更优。

关键词: 中文文本转结构化查询语言, 自然语言查询, 中文SQL语句生成, 预训练模型, Text-to-SQL数据集

Abstract:

Postgraduate admissions consultation is a representative short-term and high-frequency Question-and-Answer (Q&A) application scenario. In response to the problem that the enrollment Q&A system based on the word vector method is not precise enough to return answers, and the problem of needing to update the question database every year, this paper introduces the RESDSQL model based on Text-to-Structured Query Language (SQL) technology to convert questions into SQL statements and then query answers in a structured database. This study collects high-frequency counseling questions in postgraduate admissions scenarios, establishes question and SQL statement templates based on real admissions data from three universities, and constructs a dataset by filling the templates, getting a dataset with a total of 1 501 training sets and 386 validation sets. The RoBERTa model is replaced with the XLM-RoBERTa model that has a stronger multi-language generative capability, the T5 model is replaced with mT5 model, and the models are fine-tuned on the target domain dataset, achieving high accuracy on the enrollment domain problem, with execution accuracy of 0.95 and exact match of 1 on the RESDSQL model base on mT5-large. Compared with the C3SQL method based on ChatGPT3.5 model and zero-shot prompting, both performance and cost of the proposed method are better.

Key words: Chinese Text-to-Structured Query Language (SQL), natural language query, SQL statement generation in Chinese, pre-trained models, Text-to-SQL datasets

王庆丰, 李旭, 姚春龙, 程腾腾. 面向研究生招生咨询的中文Text-to-SQL模型[J]. 计算机工程, 2025, 51(3): 362-368.

WANG Qingfeng, LI Xu, YAO Chunlong, CHENG Tengteng. Chinese Text-to-SQL Model for Postgraduate Admissions Consultation[J]. Computer Engineering, 2025, 51(3): 362-368.

https://www.ecice06.com/CN/Y2025/V51/I3/362

图/表 7

图1 RESDSQL模型结构图

Fig.1 Structure diagram of RESDSQL model

图2 招生数据库结构图

Fig.2 Structure diagram of the postgraduate admissions database

图3 mT5-base模型在招生数据集上微调的准确率变化

Fig.3 Accuracy changes of fine-tuning the mT5-base model on the admissions dataset

图4 mT5-large模型在招生数据集上微调的准确率变化

Fig.4 Accuracy changes of fine-tuning the mT5-large model on the admissions dataset

参考文献 24

1	LI H, ZHANG J, LI C, et al. RESDSQL: decoupling schema linking and skeleton parsing for Text-to-SQL[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2302.05965.
2	XUE L, CONSTANT N, ROBERTS A, et al. mT5: a massively multilingual pre-trained text-to-text transformer[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2010.11934.
3	陈思彤. 面向高校招生的智能问答系统的研究与实现[D]. 沈阳: 沈阳师范大学, 2019.
	CHEN S T. Research on intelligent question answering system based on college enrollment[D]. Shenyang: Shenyang Normal University, 2019. (in Chinese)
4	丁怡心. 研究生招生咨询智能问答系统的设计与实现[D]. 北京: 北京邮电大学, 2019.
	DING Y X. Design and implementation of the smart question answering system for postgraduate enrollment consultation[D]. Beijing: Beijing University of Posts and Telecommunications, 2019. (in Chinese)
5	刘连喜. 基于深度学习的高校招生问答服务系统的研究及实现[D]. 重庆: 重庆理工大学, 2020.
	LIU L X. Research and implementation of college enrollment question and answer service system based on deep learning[D]. Chongqing: Chongqing University of Technology, 2020. (in Chinese)
6	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-09-09]. http://arxiv.org/abs/1810.04805.
7	CAO R, CHEN L, CHEN Z, et al. LGESQL: line graph enhanced Text-to-SQL model with mixed local and non-local relations[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2106.01093.
8	CAI R, YUAN J, XU B, et al. SADGA: structure-aware dual graph aggregation network for Text-to-SQL[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2111.00653.
9	WANG B, SHIN R, LIU X, et al. RAT-SQL: relation-aware schema encoding and linking for Text-to-SQL parsers[EB/OL]. [2023-09-09]. http://arxiv.org/abs/1911.04942.
10	ZHAO C, SU Y, PAULS A, et al. Bridging the generalization gap in Text-to-SQL parsing with schema expansion[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022: 5568-5578.
11	HUANG J, WANG Y, WANG Y, et al. Relation aware semi-autoregressive semantic parsing for NL2SQL[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2108.00804.
12	RUBIN O, BERANT J. SmBoP: semi-autoregressive bottom-up semantic parsing[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2010.12412.
13	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[EB/OL]. [2023-09-09]. http://arxiv.org/abs/1910.10683.
14	SCHOLAK T, SCHUCHER N, BAHDANAU D. PICARD: parsing incrementally for constrained auto-regressive decoding from language models[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2109.05093. .
15	YU T, ZHANG R, YANG K, et al. Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task[EB/OL]. [2023-09-09]. http://arxiv.org/abs/1809.08887.
16	DONG X, ZHANG C, GE Y, et al. C3: zero-shot Text-to-SQL with ChatGPT[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2307.07306.
17	POURREZA M, RAFIEI D. DIN-SQL: decomposed in-context learning of Text-to-SQL with self-correction[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2304.11015.
18	GAN Y, CHEN X, PURVER M. Exploring underexplored limitations of cross-domain Text-to-SQL generalization[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2109.05157.
19	WU K, WANG L, LI Z, et al. Data augmentation with hierarchical SQL-to-Question generation for cross-domain Text-to-SQL parsing[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2103.02227.
20	吕剑清, 王先兵, 陈刚, 等. 面向工业生产的中文Text-to-SQL模型. 计算机应用, 2022, 42 (10): 2996- 3002.
	LÜ J Q , WANG X B , CHEN G , et al. Chinese Text-to-SQL model for industrial production. Journal of Computer Applications, 2022, 42 (10): 2996- 3002.
21	何佳壕, 刘喜平, 舒晴, 等. 带复杂计算的金融领域自然语言查询的SQL生成. 浙江大学学报(工学版), 2023, 47 (2): 277- 286.
	HE J H , LIU X P , SHU Q , et al. SQL generation from natural language queries with complex calculations on financial data. Journal of Zhejiang University (Engineering Science), 2023, 57 (2): 277- 286.
22	LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-09-09]. http://arxiv.org/abs/1907.11692.
23	SUN N, YANG X, LIU Y. TableQA: a large-scale Chinese Text-to-SQL dataset for table-aware SQL generation[EB/OL]. [2023-09-09]. http://arxiv.org/abs/2006.06434.
24	WANG L, ZHANG A, WU K, et al. DuSQL: a large-scale and pragmatic Chinese Text-to-SQL dataset[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020: 6923-6935.

[1]	朱红, 王阔然, 朱彤. 基于多侧面信息表征联合的实体相似性度量及对齐方法[J]. 计算机工程, 2025, 51(3): 64-75.
[2]	郭俊辰, 马御棠, 相艳, 赵学东, 郭军军. 基于Prompt打分的实体链接方法[J]. 计算机工程, 2025, 51(3): 334-341.
[3]	饶东宁, 许正辉, 梁瑞仕. 基于知识库问答的回答生成研究[J]. 计算机工程, 2025, 51(2): 94-101.
[4]	姚利峰, 蔡满春, 朱懿, 陈咏豪, 张溢文. 基于字节编码与预训练任务的加密流量分类模型[J]. 计算机工程, 2025, 51(2): 188-201.
[5]	费涛, 艾山·吾买尔, 杜文旭, 朱翠翠. 基于Squeezeformer的多颗粒度多方面发音质量评测方法[J]. 计算机工程, 2025, 51(1): 81-87.
[6]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[7]	周昭辰, 方清茂, 吴晓红, 胡平, 何小海. 基于MacBERT与对抗训练的机器阅读理解模型[J]. 计算机工程, 2024, 50(5): 41-50.
[8]	李田芳, 普园媛, 赵征鹏, 徐丹, 钱文华. 基于CLIP和双空间自适应归一化的图像翻译[J]. 计算机工程, 2024, 50(5): 229-240.
[9]	侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木. 面向"一带一路"的低资源语言机器翻译研究[J]. 计算机工程, 2024, 50(4): 332-341.
[10]	于明诚, 党亚固, 吴奇林, 吉旭, 毕可鑫. 基于多尺度上下文的英文作文自动评分研究[J]. 计算机工程, 2024, 50(3): 259-266.
[11]	张文博, 黄浩, 吴迪, 唐敏杰. 基于MEGA网络和分层预测的标点恢复方法[J]. 计算机工程, 2024, 50(12): 396-406.
[12]	孙仁科, 许靖昊, 皇甫志宇, 李仲年, 许新征. 基于视觉-语言预训练模型的零样本迁移学习方法综述[J]. 计算机工程, 2024, 50(10): 1-15.
[13]	曹发鑫, 孙媛媛, 王治政, 潘丁豪, 林鸿飞. 面向借贷案件的相似案例匹配模型[J]. 计算机工程, 2024, 50(1): 306-312.
[14]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[15]	朱红, 牛浩然, 朱彤. 基于字词融合与对抗训练的行业人物实体识别[J]. 计算机工程, 2023, 49(5): 56-62.

选择文件类型/文献管理软件名称

选择包含的内容