作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (3): 98-105. doi: 10.19678/j.issn.1000-3428.0067010

• 人工智能与模式识别 • 上一篇    下一篇

专有名词增强的复述生成方法研究

张雪1,*(), 陈钰枫1, 徐金安1, 田凤占2   

  1. 1. 北京交通大学计算机与信息技术学院, 北京 100044
    2. 北京天润融通科技股份有限公司, 北京 100176
  • 收稿日期:2023-02-22 出版日期:2024-03-15 发布日期:2023-06-08
  • 通讯作者: 张雪
  • 基金资助:
    国家自然科学基金面上项目(61976016); 国家自然科学基金面上项目(61976015); 国家自然科学基金面上项目(61876198); 国家重点研发计划(2020AAA0108001)

Research on Proper Noun-Enhanced Method for Paraphrase Generation

Xue ZHANG1,*(), Yufeng CHEN1, Jin'an XU1, Fengzhan TIAN2   

  1. 1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
    2. Beijing T&I Net Communication Co., Ltd., Beijing 100176, China
  • Received:2023-02-22 Online:2024-03-15 Published:2023-06-08
  • Contact: Xue ZHANG

摘要:

现有的中文复述生成模型在对包含专有名词的原句生成复述句时经常丢失原句中的专有名词,造成复述句的语义偏移,降低复述句的可用性,进而影响其在下游任务中的应用效果。为了解决这类问题,提出专有名词增强的复述生成方法。针对包含单个专有名词的原句构建基于占位符的复述生成模型,通过将训练句对中的专有名词用占位符替换,训练模型对占位符的保留能力;针对包含多个专有名词的原句构建词汇约束的复述生成模型,通过将专有名词列表与原句拼接并进行区分,训练模型对多个专有名词的识别和复制能力,提高复述句对专有名词的保留率。此外,综合考虑语义一致性和表达多样性,提出参考句无关的复述句质量评价指标用来评估生成复述句的质量。以真实对话系统业务中的意图识别冷启动任务为下游任务,对比不同模型生成复述句的质量以及在意图识别任务上的准确率。实验结果表明,词汇约束的复述生成模型能够生成与原句语义一致且表达具有多样性的高质量复述语料,对应语料训练得到的意图识别模型准确率最高,相较于未考虑专有名词的复述模型,意图识别模型的准确率提高了5.38%。

关键词: 复述生成, 语义偏移, 占位符, 词汇约束, 意图识别

Abstract:

Existing Chinese paraphrase generation models often lose proper nouns in the original sentence when generating paraphrased sentences, which results in semantic deviation and reduces the usability of the paraphrased sentence, decreasing performance on downstream tasks. To solve these problems, this study proposes proper noun-enhanced method for paraphrase generation. Specifically, a placeholder-based paraphrase generation model is proposed for an original sentence containing a single proper noun. The model retains the placeholder by replacing the proper noun in the training sentence pair with a placeholder, to train the moeel's ability to retain placeholder. A lexically constrained paraphrase generation model is proposed for an original sentence containing multiple proper nouns. By concatenating and distinguishing the list of proper nouns from the original sentence, the model is trained to recognize and reproduce multiple proper nouns, improving the appropriate noun retention rate in the paraphrased sentences. In addition, a reference-free metric is proposed to evaluate the quality of the generated paraphrased sentences by considering both semantic consistency and expression diversity. This study considers the intent recognition cold-start task in a real dialogue business system as the downstream task. By comparing the quality of the paraphrased sentences generated by different models and the accuracy of the intent recognition task, the experimental results show that the lexical-constrained paraphrase generation model can generate a high-quality paraphrase corpus, and the related model has the highest accuracy rate. Compared to the paraphrase model which does not consider proper nouns, the accuracy is increased by 5.38%.

Key words: paraphrase generation, semantic deviation, placeholder, lexical-constraint, intention recognition