作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (3): 362-368. doi: 10.19678/j.issn.1000-3428.0068504

• 开发研究与工程应用 • 上一篇    

面向研究生招生咨询的中文Text-to-SQL模型

王庆丰1, 李旭2,*(), 姚春龙1, 程腾腾1   

  1. 1. 大连工业大学信息科学与工程学院, 辽宁 大连 116034
    2. 大连工业大学工程训练中心, 辽宁 大连 116034
  • 收稿日期:2023-10-07 出版日期:2025-03-15 发布日期:2025-03-17
  • 通讯作者: 李旭
  • 基金资助:
    辽宁省教育厅青年科技人才“育苗”项目(J2020113); 辽宁省教育厅科学研究项目(LJKZ0537); 2024年度辽宁省属本科高校基本科研业务费专项资金资助项目(LJ212410152070)

Chinese Text-to-SQL Model for Postgraduate Admissions Consultation

WANG Qingfeng1, LI Xu2,*(), YAO Chunlong1, CHENG Tengteng1   

  1. 1. School of Information Science and Engineering, Dalian Polytechnic University, Dalian 116034, Liaoning, China
    2. Innovation and Entrepreneurship Center, Dalian Polytechnic University, Dalian 116034, Liaoning, China
  • Received:2023-10-07 Online:2025-03-15 Published:2025-03-17
  • Contact: LI Xu

摘要:

研究生招生咨询是一种具有代表性的短时间高频次问答应用场景。针对现有基于词向量等方法的招生问答系统返回答案不够精确, 以及每年需要更新问题库的问题, 引入了基于文本转结构化查询语言(Text-to-SQL)技术的RESDSQL模型, 可将自然语言问题转化为SQL语句后到结构化数据库中查询答案并返回。搜集了研究生招生场景中的高频咨询问题, 根据3所高校真实招生数据, 构建问题与SQL语句模板, 通过填充模板的方式构建数据集, 共有训练集1 501条、测试集386条。将RESDSQL的RoBERTa模型替换为具有更强多语言生成能力的XLM-RoBERTa模型、T5模型替换为mT5模型, 并在目标领域数据集上进行微调, 在招生领域问题上取得了较高的准确率, 在mT5-large模型上执行正确率为0.95, 精确匹配率为1。与基于ChatGPT3.5模型、使用零样本提示的C3SQL方法对比, 该模型性能与成本均更优。

关键词: 中文文本转结构化查询语言, 自然语言查询, 中文SQL语句生成, 预训练模型, Text-to-SQL数据集

Abstract:

Postgraduate admissions consultation is a representative short-term and high-frequency Question-and-Answer (Q&A) application scenario. In response to the problem that the enrollment Q&A system based on the word vector method is not precise enough to return answers, and the problem of needing to update the question database every year, this paper introduces the RESDSQL model based on Text-to-Structured Query Language (SQL) technology to convert questions into SQL statements and then query answers in a structured database. This study collects high-frequency counseling questions in postgraduate admissions scenarios, establishes question and SQL statement templates based on real admissions data from three universities, and constructs a dataset by filling the templates, getting a dataset with a total of 1 501 training sets and 386 validation sets. The RoBERTa model is replaced with the XLM-RoBERTa model that has a stronger multi-language generative capability, the T5 model is replaced with mT5 model, and the models are fine-tuned on the target domain dataset, achieving high accuracy on the enrollment domain problem, with execution accuracy of 0.95 and exact match of 1 on the RESDSQL model base on mT5-large. Compared with the C3SQL method based on ChatGPT3.5 model and zero-shot prompting, both performance and cost of the proposed method are better.

Key words: Chinese Text-to-Structured Query Language (SQL), natural language query, SQL statement generation in Chinese, pre-trained models, Text-to-SQL datasets