作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (2): 125-131. doi: 10.19678/j.issn.1000-3428.0060501

• 人工智能与模式识别 • 上一篇    下一篇

基于预训练语言模型的关键词感知问题生成

于尊瑞1, 毛震东1, 王泉2, 张勇东1   

  1. 1. 中国科学技术大学 信息科学技术学院, 合肥 230000;
    2. 北京百度网讯科技有限公司, 北京 100000
  • 收稿日期:2021-01-06 修回日期:2021-02-21 发布日期:2021-02-26
  • 作者简介:于尊瑞(1996-),男,硕士研究生,主研方向为问题生成;毛震东(通信作者),研究员、博士生导师;王泉,博士;张勇东,教授、博士生导师。
  • 基金资助:
    国家自然科学基金(U19A2057)。

Keyword Aware Question Generation Based on Pre-Trained Language Model

YU Zunrui1, MAO Zhendong1, WANG Quan2, ZHANG Yongdong1   

  1. 1. School of Information Science and Technology, University of Science and Technology of China, Hefei 230000, China;
    2. Beijing Baidu Netcom Science Technology Co., Ltd., Beijing 100000, China
  • Received:2021-01-06 Revised:2021-02-21 Published:2021-02-26

摘要: 问题生成任务是指根据给定的文本段落和答案来自动生成对应的问题。针对现有问题生成方法存在的误差累积现象以及问题生成任务固有的“一对多”情况,提出一种带有关键词感知功能的问题生成方法。在预训练语言模型的基础上,实现关键词分类模型与问题生成模型的网络结构设计。输入文本段落中蕴含关键词,为使所生成的问题中包含同样的关键词以保证问题与段落的语义一致性,利用关键词分类模型提取出文本段落中的关键词,将关键词与非关键词的区分特征融入问题生成模型的输入中,该特征作为问题生成过程的全局信息,用以消除问题生成模型仅依赖局部最优解的弊端,减少误差累积与“一对多”情况的发生。在SQuAD数据集上的实验结果表明,该方法能够提升问题生成的质量,其BLEU-4指标值可达24,优于带有复制机制、带有语义监督的问题生成模型,目前已经借助百度百科数据平台实现了大规模工业应用。

关键词: 问题生成, 预训练语言模型, 关键词分类, 自注意力掩码, 嵌入向量

Abstract: The Question Generation(QG) task is to automatically generate the corresponding question based on a given text paragraph and answer.The existing QG methods often fail to deal with error accumulation and the one-answer-to-multiple-question problem in QG tasks.To address the problem, this paper proposes a keyword aware question generation method.We design the network structure for keyword classification and QG based on the pre-trained language model.To make the generated question include the same keywords as the input paragraph, which ensures the semantic consistency between the question and paragraph, we use the keyword classification model to extract the keywords in the paragraph, and integrate the feature that distinguish keywords from non-keywords into the input of the QG model.The feature acts as the global information of QG process to reduce dependency of the QG model on the local optimal solution only, and reduce the occurrence of error accumulation and one-answer-to-multiple-question problem.The experimental results on the SQuAD dataset show that this method can improve the quality of generated questions.Its BLEU-4 value reaches up to 24, higher than the QG models with replication mechanism or semantic supervision.This method has realized large-scale industrial application based on the Baidu Encyclopedia, a ten-million-scale data platform.

Key words: Question Generation(QG), pre-trained language model, keyword classification, self-attention mask, embedding vector

中图分类号: