作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (3): 304-311. doi: 10.19678/j.issn.1000-3428.0064014

• 开发研究与工程应用 • 上一篇    下一篇

融合ELECTRA和文本局部信息的中文语法错误检测方法

陈柏霖1,2, 王天极1,2, 任丽娜1,2,3, 黄瑞章1,2   

  1. 1. 贵州大学 公共大数据国家重点实验室, 贵阳 550025;
    2. 贵州大学 计算机科学与技术学院, 贵阳 550025;
    3. 贵州轻工职业技术学院, 贵阳 550025
  • 收稿日期:2022-02-23 修回日期:2022-04-11 发布日期:2022-08-08
  • 作者简介:陈柏霖(1997—),男,硕士研究生,主研方向为自然语言处理;王天极,硕士研究生;任丽娜,博士研究生;黄瑞章,教授、博士。
  • 基金资助:
    国家自然科学基金(62066007)。

Method for Chinese Grammar Error Detection Integrating ELECTRA and Text Local Information

CHEN Bailin1,2, WANG Tianji1,2, REN Lina1,2,3, HUANG Ruizhang1,2   

  1. 1. State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China;
    2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China;
    3. Guizhou Institute of Light Industry, Guiyang 550025, China
  • Received:2022-02-23 Revised:2022-04-11 Published:2022-08-08

摘要: 语法错误检测是自然语言处理领域的一项基本任务,其目标是自动识别文本中存在的错别字、语法及语序错误等。与其他语言相比,中文语法灵活多变且缺乏时态、语态等标志性信息,因此,文本的局部信息对于中文语法错误检测具有重要作用。传统的机器学习方法难以检测文本中存在的语法错误,而现有深度学习方法在纠错过程中不能充分利用文本的局部信息,导致语法错误检测效果不佳。建立一种融合ELECTRA和文本局部信息的中文语法错误检测模型ELECTRA-GCNN-CRF。将语法错误检测视为序列标注任务,使用ELECTRA预训练语言模型对文本进行表征。采用卷积神经网络提取文本的局部位置和语义信息,并引入残差门控机制,降低无效信息带来的影响。通过CRF模型学习标签间的内在关联关系,输出符合标注规则的语法错误标签序列。在NLPTEA中文语法错误检测数据集上的实验结果表明,ELECTRA-GCNN-CRF在检测层、识别层和定位层上的F1值较对比基线模型分别平均提高了0.94、3.74和5.03个百分点,该模型能够有效提升语法错误检测效果。

关键词: ELECTRA预训练语言模型, 局部信息, 中文语法错误检测, 卷积神经网络, 残差门控机制

Abstract: Grammar error detection is a basic task in natural language processing.The task aims to automatically identify typos, grammar, and word order errors in text.Compared with other languages, Chinese grammar is flexible and lacks symbolic information such as tense and voice.Therefore, the local information of the text plays an important role in Chinese Grammar Error Detection(CGED).Conventional machine learning methods are difficult to detect grammatical errors in a text, whereas the existing deep learning methods cannot utilize the local information of the text during error correction fully and effectively, resulting in poor grammatical error detection effect.To solve this problem, this study proposes a CGED model, ELECTRA-GCNN-CRF, integrating an ELECTRA and the local information of the text.Grammar error detection is regarded as a sequence annotation task.First, the text is represented by an ELECTRA pre-training language model.Second, a Convolution Neural Network(CNN) is used to extract the local position and semantic information of the text and the residual and gating mechanisms are introduced to reduce the impact of invalid information.Finally, the internal relationship between tags is learned through a CRF model, and the grammar error tag sequence conforming to the labeling rules is output.The model proposed in this study is tested on the Chinese grammatical error evaluation dataset of NLPTEA.The F1 values of detection-, identification-, and position-level increased by 0.94, 3.74, and 5.03 percentage points, respectively, compared with the baseline model, which improves the effect of grammatical error detection.

Key words: ELECTRA pre-training language model, local information, Chinese Grammar Error Detection(CGED), Convolution Neural Network(CNN), residual gated mechanism

中图分类号: