作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (3): 326-335. doi: 10.19678/j.issn.1000-3428.0067251

• 开发研究与工程应用 • 上一篇    下一篇

SQL-to-text模型的组合泛化能力评估方法

陈琳1,*(), 范元凯1, 何震瀛1, 刘晓清1, 杨阳2, 汤路民2   

  1. 1. 复旦大学计算机科学技术学院, 上海 200433
    2. 星环信息科技(上海)股份有限公司, 上海 200233
  • 收稿日期:2023-03-23 出版日期:2024-03-15 发布日期:2023-10-30
  • 通讯作者: 陈琳

Combinatorial Generalization Ability Evaluation Method of SQL-to-text Model

Lin CHEN1,*(), Yuankai FAN1, Zhenying HE1, Xiaoqing LIU1, Yang YANG2, Lumin TANG2   

  1. 1. School of Computer Science, Fudan University, Shanghai 200433, China
    2. Transwarp Information Technology(Shanghai) Co., Ltd., Shanghai 200233, China
  • Received:2023-03-23 Online:2024-03-15 Published:2023-10-30
  • Contact: Lin CHEN

摘要:

数据库的结构化查询语言(SQL)到自然语言的翻译(SQL-to-text)能提高关系数据库的易用性。近年来该领域主要使用机器学习的方法进行研究并已取得一定进展,然而现有翻译模型的能力仍不足以投入实际应用。由于组合泛化能力是SQL-to-text模型在实际应用中提升翻译效果的必要能力,且目前缺少对此类模型组合泛化能力的研究,因此提出一种SQL-to-text模型的组合泛化能力评估方法。基于现有的SQL-to-text数据集生成大量SQL和对应的自然语言翻译(SQL-自然语言对),并按SQL-自然语言对所含SQL子句的个数将其划分为训练数据与测试数据,使测试数据中的SQL子句皆以不同的组合方式在训练数据中出现,从而得到可评估模型组合泛化能力的新数据集。评估结果表明,该方法对查询知识的使用程度较高,划分数据的方式更加合理,所得数据集符合评估组合泛化能力的需求且贴近模型的实际应用场景,受到原始数据集的限制程度更低,并证实现有模型的组合泛化能力仍需提升,其中针对SQL-to-text任务设计的关系感知图转换器模型组合泛化能力最弱,表明原有的SQL-to-text数据集对组合泛化能力的考察存在欠缺。

关键词: 结构化查询语言, 组合泛化, 机器翻译, 数据库, 长短期记忆模型

Abstract:

Translating from Structured Query Language(SQL) to natural language can improve the usability of a database. Some progress is currently being made in this research, which mainly uses machine learning models. However, the capabilities of the existing translation models are still insufficient for practical applications. Because combinatorial generalization is a necessary ability for an SQL-to-text model to improve the translation effect in practical applications, and there is currently a lack of research on this ability for such models, a combination of SQL-to-text models is proposed as a generalization ability assessment method. This method generates a large amount of SQL and corresponding natural-language translations(referred to as SQL-natural language pairs) based on an existing SQL-to-text dataset. These SQL-natural language pairs are then divided into training and test data according to the number of SQL clauses they contain. Thus, the SQL clauses in the test data appear in the training data in different combinations, which produces a new data set that can be used to evaluate the generalization ability of the model combination. The evaluation results show that this method has a higher degree of query-knowledge use. It utilizes a more reasonable method to divide data, and the obtained data set meets the requirements for the evaluation of combinatorial generalization ability. It is close to the actual application scenario of the model, and is less restricted by the original data set. The combinatorial generalization ability of the existing models still needs to be further improved. Among them, the relationship-aware graph converter model designed for SQL-to-text tasks has the weakest combinatorial generalization ability, indicating that the original SQL-to-text data set is insufficient for the investigation of the combinatorial generalization ability.

Key words: Structured Query Language(SQL), compositional generalization, machine translation, database, Long Short-Term Memory(LSTM) model