作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (11): 77-84. doi: 10.19678/j.issn.1000-3428.0066089

• 人工智能与模式识别 • 上一篇    下一篇

基于语法知识增强的中文语法纠错

邓倩, 陈曙, 叶俊民   

  1. 华中师范大学 计算机学院, 武汉 430079
  • 收稿日期:2022-10-24 出版日期:2023-11-15 发布日期:2023-11-08
  • 作者简介:

    邓倩(1999—),女,硕士研究生,主研方向为自然语言处理

    陈曙,博士

    叶俊民,教授、博士

  • 基金资助:
    国家社会科学基金后期资助项目(20FTQB020)

Chinese Grammatical Error Correction Based on Grammatical Knowledge Enhancement

Qian DENG, Shu CHEN, Junmin YE   

  1. School of Computer Science, Central China Normal University, Wuhan 430079, China
  • Received:2022-10-24 Online:2023-11-15 Published:2023-11-08

摘要:

语法纠错旨在判断自然语言文本中是否包含语法错误并对句子进行纠正。随着预训练语言模型的迅速发展,基于预训练语言模型的方法被广泛应用于中文语法纠错(CGEC)领域,然而现有的预训练语言模型缺乏语法纠错领域的特定语法知识,导致语法纠错效果不佳。针对该问题,提出一种基于语法知识图谱预训练模型的CGEC模型。首先进行结构化知识编码,将语法知识图谱中的结构化知识映射到词语实体嵌入中,然后通过特定的预训练掩码策略联合学习上下文和词语之间的语法知识以预测字符和词语,最后通过设置检错网络和纠错网络对预训练模型进行微调,以完成CGEC任务。通过上述过程充分提取语法知识,以帮助模型更好地捕捉句子中词语之间的语法关系。在NLPCC 2018测试数据集上的实验结果表明,语法知识增强的方法使得模型的F0.5值提升4.83个百分点,所提模型的F0.5值相比NLPCC 2018共享任务中排名第一的模型高8.85个百分点,验证了基于语法知识图谱的预训练模型在CGEC任务中的有效性。

关键词: 语法纠错, 预训练语言模型, 异构知识编码, 知识图谱, 深度学习

Abstract:

The aim of grammatical error correction is to judge whether natural language texts contain grammatical errors, to correct them. In recent years, with the rapid development of pre-trained language models, methods based on such models have been widely used in the field of Chinese Grammatical Error Correction(CGEC). However, the existing pre-trained language models lack specific grammatical knowledge in the grammatical error correction field, resulting in poor grammar correction effect. To solve this problem, this paper proposes a CGEC model based on a pre-training model with grammatical knowledge graph. First, the model uses structured knowledge encoding to map the structured knowledge into word entity embedding. Subsequently, the context and grammatical knowledge between words are jointly learned through a specific pre-training mask strategy, to predict characters and words. Finally through error detection and correction networks, the pre-training model is fine-tuned for CGEC. Based on the serial application of these three components, grammatical knowledge can be extracted to a greater extent, thereby helping the model better capture the grammatical relationship between words in sentences. The experimental results on the NLPCC 2018 test dataset show that the method for enhancing grammatical knowledge improves F0.5 score of the model by 4.83 percentage points, and F0.5 score of the proposed model is 8.85 percentage points higher than that of the first model on the NLPCC 2018 shared task, which proves the effectiveness of using the pre-training model based on grammatical knowledge graph for CGEC.

Key words: grammatical error correction, pre-trained language model, heterogeneous knowledge encoding, knowledge graph, deep learning