作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (2): 404-412. doi: 10.19678/j.issn.1000-3428.0070118

• 大模型与生成式人工智能 • 上一篇    

基于大语言模型的语料库查询自动生成方法

张成辉1,2, 罗景1,2, 涂新辉3, 陈雨霖1,2   

  1. 1. 武汉科技大学计算机科学与技术学院, 湖北 武汉 430065;
    2. 智能信息处理与实时工业系统湖北省重点实验室, 湖北 武汉 430065;
    3. 华中师范大学计算机学院, 湖北 武汉 430079
  • 收稿日期:2024-07-15 修回日期:2024-08-17 发布日期:2024-10-10
  • 作者简介:张成辉,男,硕士研究生,主研方向为自然语言处理、信息检索;罗景(通信作者,E-mail:luojing@wust.edu.cn)、涂新辉,副教授;陈雨霖,硕士研究生。
  • 基金资助:
    国家语委重点科研项目(ZDI145-22);湖北省高等学校哲学社会科学研究项目(23Y025)。

Automatic Corpus Query Generation Method Based on Large Language Model

ZHANG Chenghui1,2, LUO Jing1,2, TU Xinhui3, CHEN Yulin1,2   

  1. 1. School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China;
    2. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan 430065, Hubei, China;
    3. School of Computer Science, Central China Normal University, Wuhan 430079, Hubei, China
  • Received:2024-07-15 Revised:2024-08-17 Published:2024-10-10

摘要: 语料库查询语言(CQL)是一种用于在语料库中进行检索和分析的查询语言,自然语言自动生成CQL指将用户以自然语言表达的查询需求自动转换为标准的CQL语句,大大降低了用户使用语料库的门槛。虽然大语言模型(LLM)可以较好地完成自然语言生成任务,但是在CQL生成任务中效果还不是很理想。为此,提出一种基于LLM上下文学习的语料库查询自动生成方法T2CQL。首先,基于CQL的编写规则总结出一套简洁全面的文本到CQL(Text-to-CQL)语法知识,作为LLM实现Text-to-CQL自动转换的基础,以弥补LLM在此领域知识储备的不足。然后,基于选定的嵌入模型,选取与当前自然语言查询最接近的前k个文本-CQL样本对,以帮助LLM理解语法知识并作为参照。最后,采用生成结果校准策略来减轻LLM在生成CQL时的偏差,通过校准模型偏差提升模型生成CQL语句的性能。实验使用多个LLM在包含1 177条数据的测试集上进行测试。实验结果表明,T2CQL方法显著提升了LLM在完成Text-to-CQL自动转换任务时的性能,最优的执行准确率(EX)达到了85.13%。

关键词: 语料库查询语言, 大语言模型, 上下文学习, 自然语言处理, 提示工程

Abstract: Corpus Query Language (CQL) is a specialized tool for searching and analyzing linguistic corpora. Automating the conversion of natural language queries into CQL statements significantly lowers entry barriers for corpus users. Although Large Language Models (LLMs) excel in many natural language generation tasks, their performance in generating CQL statements has been suboptimal. To address this issue, a method for automatic corpus query generation based on contextual learning in LLMs, called T2CQL, is proposed. First, this method distills CQL writing rules into a comprehensive yet concise set of Text-to-CQL grammar knowledge standards. This serves as the basis for the LLMs to perform automatic Text-to-CQL conversions, compensating for their lack of domain-specific knowledge. Subsequently, the top k most relevant Text-CQL sample pairs for the current natural language query are selected using an embedding model. These samples serve as reference points and help the LLMs understand the grammar rules. Finally, a calibration strategy to mitigate biases in the LLM's CQL generation is implemented, thereby enhancing its performance. The proposed method is evaluated using multiple LLM on a test set of 1 177 samples. The results demonstrate that T2CQL significantly improves the performance of LLMs in Text-to-CQL conversion tasks, achieving an optimal Execution Accuracy (EX) of 85.13%.

Key words: Corpus Query Language (CQL), Large Language Model (LLM), in-context learning, Natural Language Processing (NLP), prompt engineering

中图分类号: